Upload
others
View
15
Download
1
Embed Size (px)
Citation preview
Lecture 7
• Principal Component Analysis
– Dimensionality reduction via feature extraction. Emphasis is onrepresenting the original signal as accurately as possible in the lowerdimensional space.
• Linear Discriminant Analysis
– Dimensionality reduction via feature extraction. Emphasis is toenhance the class-discrimination in the lower dimensional space.
Principal Components Analysis
• The curse of dimensionality
The curse of dimensionality
• The curse of dimensionality
– A term coined by Richard Bellman in 1961,– Refers to the problems associated with multivariate data analysis as
the dimensionality increases.
Consider a 3-class pattern recognition problem
• A simple approach would be to
– Divide the feature space into uniform bins
– Compute the ratio of examples for each class at each bin and,
– For a new example, find its bin and choose the predominant class in that bin
• In our toy problem we decide to start with one single feature and dividethe real line into 3 segments
Introduction to Pattern AnalysisRicardo Gutierrez-OsunaTexas A&M University
2
! The curse of dimensionality" A term coined by Bellman in 1961" Refers to the problems associated with multivariate data analysis as the
dimensionality increases" We will illustrate these problems with a simple example
! Consider a 3-class pattern recognition problem" A simple approach would be to
! Divide the feature space into uniform bins! Compute the ratio of examples for each class at each bin and, ! For a new example, find its bin and choose the predominant class in that bin
" In our toy problem we decide to start with one single feature and divide the real line into 3 segments
" After doing this, we notice that there exists too much overlap among the classes, so we decide to incorporate a second feature to try and improve separability
The curse of dimensionality (1)
x1x1
• After doing this, we notice that there exists too much overlap amongthe classes, so we decide to incorporate a second feature to try andimprove separability
We decide to preserve the granularity of each axis, which raises the numberof bins from 3 (in 1D) to 32 = 9 (in 2D)
• Need to make a decision: Maintain the density of examples per binor keep the number of examples as for the 1D case ?
– Maintaining the density increases the number of examples from 9 (in 1D) to 27
(in 2D)
– Maintaining the number of examples results in a very sparse 2D scatter plot
Introduction to Pattern AnalysisRicardo Gutierrez-OsunaTexas A&M University
3
! We decide to preserve the granularity of each axis, which raises the number of bins from 3 (in 1D) to 32=9 (in 2D)
" At this point we need to make a decision: do we maintain the density of examples per bin or do we keep the number of examples had for the one-dimensional case?
! Choosing to maintain the density increases the number of examples from 9 (in 1D) to 27 (in 2D)! Choosing to maintain the number of examples results in a 2D scatter plot that is very sparse
! Moving to three features makes the problem worse:" The number of bins grows to 33=27" For the same density of examples the number of needed
examples becomes 81" For the same number of examples, well, the 3D scatter
plot is almost empty
The curse of dimensionality (2)
x1
x2
Constant density
x1
x2
x3
x1
x2
Constant # examples
Moving to three features makes the problem worse:
• The number of bins grows to 33 = 27
• For the same density of examples the number of needed examples becomes 81
• For the same number of examples, well, the 3D scatter plot is almost empty
Introduction to Pattern AnalysisRicardo Gutierrez-OsunaTexas A&M University
3
! We decide to preserve the granularity of each axis, which raises the number of bins from 3 (in 1D) to 32=9 (in 2D)
" At this point we need to make a decision: do we maintain the density of examples per bin or do we keep the number of examples had for the one-dimensional case?
! Choosing to maintain the density increases the number of examples from 9 (in 1D) to 27 (in 2D)! Choosing to maintain the number of examples results in a 2D scatter plot that is very sparse
! Moving to three features makes the problem worse:" The number of bins grows to 33=27" For the same density of examples the number of needed
examples becomes 81" For the same number of examples, well, the 3D scatter
plot is almost empty
The curse of dimensionality (2)
x1
x2
Constant density
x1
x2
x3
x1
x2
Constant # examples
The curse of dimensionality
• This approach to divide the sample space into equally spaced bins wasquite inefficient
– There are other approaches that are much less susceptible to the curse of
dimensionality, but the problem still exists.
• How do we beat the curse of dimensionality?
– By incorporating prior knowledge
– By providing increasing smoothness of the target function
– By reducing the dimensionality
• In practice, the curse of dimensionality means that, for a givensample size, there is a maximum number of features above whichthe performance of our classifier will degrade rather than improve
– In most cases, the additional information that is lost by discarding some features
is (more than) compensated by a more accurate mapping in the lower dimensional
space
Introduction to Pattern AnalysisRicardo Gutierrez-OsunaTexas A&M University
4
The curse of dimensionality (3)! Obviously, our approach to divide the sample space into equally
spaced bins was quite inefficient" There are other approaches that are much less susceptible to the curse of
dimensionality, but the problem still exists! How do we beat the curse of dimensionality?
" By incorporating prior knowledge" By providing increasing smoothness of the target function" By reducing the dimensionality
! In practice, the curse of dimensionality means that, for a given sample size, there is a maximum number of features above which the performance of our classifier will degrade rather than improve
" In most cases, the additional information that is lost by discarding some features is (more than) compensated by a more accurate mapping in the lower-dimensional space
dimensionality
perfo
rman
ce
The curse of dimensionalityThere are many implications of the curse of dimensionality
• Exponential growth in the number of examples required to maintain agiven sampling density
– For a density of N examples/bin and D dimensions, the total number of examples
is ND
• Exponential growth in the complexity of the target function (a densityestimate) with increasing dimensionality
– “A function defined in high-dimensional space is likely to be much more complex
than a function defined in a lower-dimensional space, and those complications are
harder to discern” - Friedman
This means that, in order to learn it well, a more complex targetfunction requires denser sample points!
• What to do if it isn’t Gaussian?
– For one dimension a large number of density functions can be found in textbooks,
but for high-dimensions only the multivariate Gaussian density is available.
Moreover, for larger values of D the Gaussian density can only be handled in
a simplified form!
• Humans have an extraordinary capacity to discern patterns and clustersin 1, 2 and 3-dimensions, but these capabilities degrade drastically for4 or higher dimensions.
Principal Components Analysis
• The curse of dimensionality
• Dimensionality reduction & Feature selection vs. featureextraction
Dimensionality reduction (1)Two approaches to dimensionality reduction
• Feature extraction: create a subset of new features by combinationsof the existing features
• Feature selection: choose a subset of all the features (the moreinformative ones)
x1
x2
...
xN
−→
xi1
xi2
xiM
︸ ︷︷ ︸
feature selection
,
x1
x2
...
xN
−→
y1
y2
yM
= f
x1
x2
...
xN
︸ ︷︷ ︸
feature extraction
The problem of feature extraction can be stated as
• Given a feature space x ∈ RN find a mapping
y = f(x) : RN → RM with M < N
such that the transformed feature vector y ∈ RM preserves (most of)the information or structure in RN .
• An optimal mapping y = f(x) will be one that results in no increasein the minimum probability of error.
– That is, a Bayes decision rule applied to the initial space RN and to the reduced
space RM yield the same classification rate.
Dimensionality reduction (2)Generally, the optimal mapping y = f(x) is a non-linear function
• However, there is no systematic way to generate non-linear transforms
– The selection of a particular subset of transforms is problem dependent
• For this reason, feature extraction is commonly limited to lineartransforms: y = Wx
– This is, y is a linear projection of x– When the mapping is a non-linear function, the reduced space is called a manifold.0BBBBBBB@
x1
x2
...
xN
1CCCCCCCA−→
0BB@y1
y2
yM
1CCA =
0BB@w11 w12 · · · w1N
w21 w22 · · · w2N... ... . . . ...
wM1 wM2 · · · wMN
1CCA0BBBBBBB@
x1
x2
...
xN
1CCCCCCCA
Principal Components Analysis
• The curse of dimensionality
• Dimensionality reduction
• Feature selection vs. feature extraction
• Signal representation vs. signal classification
Signal Representation Vs Classification
• The selection of the feature extraction mapping y = f(x) is guided byan objective function that we seek to maximize (or minimize)
• Depending on the criteria used by the objective function, featureextraction techniques are grouped into two categories:
Signal representation The goal of the
feature extraction mapping is to
represent the samples accurately in a
lower-dimensional space.
Classification The goal of the feature
extraction mapping is to enhance the
class-discriminatory information in the
lower-dimensional space.
Introduction to Pattern AnalysisRicardo Gutierrez-OsunaTexas A&M University
8
Signal representation versus classification! The selection of the feature extraction mapping y=f(x) is guided by an
objective function that we seek to maximize (or minimize)! Depending on the criteria used by the objective function, feature
extraction techniques are grouped into two categories:" Signal representation: The goal of the feature extraction mapping is to represent
the samples accurately in a lower-dimensional space" Classification: The goal of the feature extraction mapping is to enhance the
class-discriminatory information in the lower-dimensional space! Within the realm of linear feature
extraction, two techniques are commonly used
" Principal Components Analysis (PCA)! uses a signal representation criterion
" Linear Discriminant Analysis (LDA)! uses a signal classification criterion
Feature 1
Feat
ure
2
Classif
icatio
n
Signal representation
1
2
111 1 1
1111 1
11
11
11 11
11
22
2 22
22
22 222 22 2 2
2
2
• Within the realm of linear feature extraction, two techniques arecommonly used
– Principal Components Analysis (PCA) - uses a signalrepresentation criterion
– Linear Discriminant Analysis (LDA) - uses a signal classificationcriterion
Principal Components Analysis
• The curse of dimensionality
• Dimensionality reduction
• Feature selection vs. feature extraction
• Signal representation vs. signal classification
• Principal Components Analysis
Intuitive Motivation
Want to encode as accurately as possible the position of the m pointsin this cluster. Can do so exactly with their x, y-coordinate locations ofwhich there are 2m. However, say I only have bandwidth to record m + 4numbers. Intuitively, what should these numbers represent ?
Intuitive Motivation
Devote 2 to the center of mass of the points.
Intuitive Motivation
Use 2 numbers to define a direction which corresponds to thedirection in which there is most variation.
Intuitive Motivation
Let the other m numbers represent the distance of the pointprojected onto the line from the centre point.
Principal Components Analysis, PCA
The objective of PCA is to perform dimensionality reduction whilepreserving as much of the randomness (variance) in the high-dimensionalspace as possible
• Let x be an N -dimensional random vector, represented as a linearcombination of orthonormal basis vectors (φ1,φ2, · · · ,φN) as
x =NX
i=1
yiφi, where φTi φj = δij
• Suppose we choose to represent x with only M(M < N) of the basisvectors. Do this by replacing the components (yM+1, · · · , yN)T withsome pre-selected constants bi
x(M) =MXi=1
yiφi +NX
i=M+1
biφi
• The representation error is then
∆x(M) = x−x(M) =NX
i=1
yiφi−
0@ MXi=1
yiφi +NX
i=M+1
biφi
1A =NX
i=M+1
(yi−bi)φi
• We can measure this representation error by the mean-squaredmagnitude of ∆x.
• Our goal is to find the basis vectors φi and constants bi that minimizethis mean-square error
ε2(M) = E
h|∆x(M)|2
i= E
24 NXi=M+1
NXj=M+1
(yi − bi)(yj − bj)φTi φj
35=
NXi=M+1
Eh(yi − bi)
2i
, as the φi’s are orthonormal
PCA (2)• The optimal values of bi can be found by computing the partial derivative of the
objective function and setting it to zero
∂
∂bi
Eh(yi − bi)
2i
= −2 (E [yi]− bi) = 0 =⇒ bi = E [yi]
– Therefore, we will replace the discarded dimensions yi’s by their expected value
• The mean-square error can then be written as
ε2(M) =
NXi=M+1
Eh(yi − E [yi])
2i
=NX
i=M+1
E
»“φ
Ti x− E
hφ
Ti x
i” “φ
Ti x− E
hφ
Ti x
i”T–
=NX
i=M+1
φTi E
h(x− E [x]) (x− E [x])
Ti
φi =NX
i=M+1
φTi Σxφi
• We seek to find the solution that minimizes this expression subject to the ortho-
normality constraint, which we incorporate into the expression using a set of Lagrange
multipliers λi
ε2(M) =
NXi=M+1
φTi Σxφi +
NXi=M+1
λi(1− φTi φi)
• Computing the partial derivative with respect to the basis vectors
∂
∂φi
ε2(M) =
∂
∂φi
24 NXi=M+1
φTi Σxφi +
NXi=M+1
λi(1− φTi φi)
35= 2 (Σxφi − λiφi) , as
∂xTAx∂x
= (A + AT)x
= 0
=⇒ Σxφi = λiφi
So φi and λi are the eigenvectors and eigenvalues of the covariance matrixΣx
PCA (3)• We can then express the sum-square error as
ε2(M) =
NXi=M+1
φTi Σxφi =
NXi=M+1
φTi λiφi =
NXi=M+1
λi
• To minimize this measure, choose λi’s to be the smallest eigenvalues.
Therefore, to represent x with minimum sum-square error, choose theeigenvectors φi corresponding to the largest eigenvalues λi.
PCA Dimensionality ReductionThe optimal approximation of a random vector x ∈ RN by a linearcombination of M(M < N) independent vectors is obtained byprojecting the random vector x onto the eigenvectors φi correspondingto the largest eigenvalues λi of the covariance matrix Σx.
Basic PCA - Implementation Guide
• Given m data points xi each of dimension N .
• Compute the mean µ = 1m
∑mi=1 xi and subtract it from each data
point.xc
i = xi − µ
• Compute the data matrix X where each column is a data point xci .
• Compute the covariance matrix, Σ = 1mXXT
• Find the eigen-vectors and -values of Σ.
• The principal components are the k eigen-vectors with highest eigen-values.
Remember SVD
Any arbitrary m × n matrix X can be converted to the product of anorthogonal matrix, a diagonal matrix and another orthogonal matrix viasingular value decomposition:
X = USV T
where
S ∼ a diagonal matrix (m× n) and contains the singular values
U ∼ a square (m×m) orthogonal matrix
V ∼ a square (n× n) orthogonal matrix
SVD PCA - Implementation Guide
• Given m data points xi each of dimension N .
• Compute the mean µ = 1m
∑mi=1 xi and subtract it from each data
point.xc
i = xi − µ
• Compute the data matrix X where each column is a data point xci .
• Let Y = 1√m
XT and perform SVD such that Y = WSV T .
• The principal components are the k singular vectors with highestsingular values (rows of V T ).
PCA (4)
NOTES
• Since PCA uses the eigenvectors of the covariance matrix Σx, it is ableto find the independent axes of the data under the unimodal Gaussianassumption
– For non-Gaussian or multi-modal Gaussian data, PCA simply de-correlates the
axes
• The main limitation of PCA is that it does not consider class separabilitysince it does not take into account the class label of the feature vector
– PCA simply performs a coordinate rotation that aligns the transformed axes with
the directions of maximum variance.
– There is no guarantee that the directions of maximum variance will
contain good features for discrimination.
Historical Remarks
• Principal Components Analysis is the oldest technique in multivariateanalysis
• PCA is also known as the Karhunen-Loeve transform (communicationtheory)
• PCA was first introduced by Pearson in 1901, and it experienced severalmodifications until it was generalized by Loeve in 1963
PCA example (1)
Introduction to Pattern AnalysisRicardo Gutierrez-OsunaTexas A&M University
13
PCA example (1) ! In this example we have a three-dimensional
Gaussian distribution with the following parameters
! The three pairs of principal component projections are shown below
" Notice that the first projection has the largest variance, followed by the second projection
" Also notice that the PCA projections de-correlates the axis (we knew this since Lecture 3, though)
-10 -5 0 5 10
0510
-8
-6
-4
-2
0
2
4
6
8
10
12
x1x2
x3
-15 -10 -5 0 5 10 15
-5
0
5
10
15
y1
y3
-10 -8 -6 -4 -2 0 2 4 6 8 10
-2
0
2
4
6
8
10
12
y2
y3
-15 -10 -5 0 5 10 15
-10
-5
0
5
10
y1
y2
! "###
$
%
&&&
'
(
)))
)**
10474417125
!and250µ T
Have a 3-dimensional Gaussian distribution
with the following parameters
µ = (0, 5, 2)T, Σ =
0@ 25 1 7
−1 4 −4
7 −4 10
1A
Introduction to Pattern AnalysisRicardo Gutierrez-OsunaTexas A&M University
13
PCA example (1) ! In this example we have a three-dimensional
Gaussian distribution with the following parameters
! The three pairs of principal component projections are shown below
" Notice that the first projection has the largest variance, followed by the second projection
" Also notice that the PCA projections de-correlates the axis (we knew this since Lecture 3, though)
-10 -5 0 5 10
0510
-8
-6
-4
-2
0
2
4
6
8
10
12
x1x2
x3
-15 -10 -5 0 5 10 15
-5
0
5
10
15
y1
y3
-10 -8 -6 -4 -2 0 2 4 6 8 10
-2
0
2
4
6
8
10
12
y2
y3
-15 -10 -5 0 5 10 15
-10
-5
0
5
10
y1
y2
! "###
$
%
&&&
'
(
)))
)**
10474417125
!and250µ T
The 3 pairs of the principal component projections - the first projectionhas largest variation. The PCA projections are de-correlated.
PCA example (2)
Compute the principal components for the following two-dimensionaldataset
x = {(x1, x2)} = {(1, 2), (3, 3), (3, 5), (5, 4), (5, 6), (6, 5), (8, 7), (9, 8)}
Solution First plot the data to get an idea of which solution we shouldexpect
PCA example (2) ctdCompute the principal components for the following two-dimensionaldataset
x = {(x1, x2)} = {(1, 2), (3, 3), (3, 5), (5, 4), (5, 6), (6, 5), (8, 7), (9, 8)}
Solution The plot of the data
PCA example (2) ctdSolution (by hand)
• The (biased) covariance estimate of the data is: Σx =
„6.25 4.25
4.25 3.5
«
• The eigenvalues are the zeros of the characteristic equation
Σxv = λv =⇒ |Σx − λI| = 0 =⇒ λ1 = 9.34, λ2 = .41
• The eigenvectors are the solutions of the system(6.25 4.254.25 3.5
)v1 = λ1v1 =⇒ v1 =
(.89.59
)What is v2 ?
Eigenfaces, PCA example
Let X = {x1,x2, · · · ,xm} be a collection of feature vectors. Each feature
vector xi ∈ [0, 255]N2
corresponds to the pixel values of a visual image(N ×N) of a face.
Eigenfaces, PCA example
The mean face is µ = 1m
∑mi=1 xi.
Let x∗i = xi − µ.The eigenfaces are the PCA basisvectors. These are found by findingthe eigenvectors of
Σx =1
m
mXi=1
x∗i (x∗i )
T= (AA
T)
EigenfacesTurk & Pentland (1992)
mean and eigen-facesNote Σx may be huge (N2 ×N2). Need to use the trick of finding the eigenvectors of
ATA (of size m×m). If (ATA)vi = λivi then
ΣxAvi = (AAT)Avi = A(A
TA)vi = λiAvi =⇒ Aviis an eigenvector of Σx.
Eigenfaces, PCA example
Standard Eigenfaces (← increasing eigenvalue)
Matlab example
• Effect of subtraction of the mean
Without mean
subtracted
With mean subtracted
Effect of the subtraction of the mean
Linear Discriminant Analysis
• Linear Discriminant Analysis, two classes
LDA, two-classes(1)
The objective of LDA is to perform dimensionality reduction while preserving as much of
the class discriminatory information as possible
• Assume we have a set of D-dimensional samples {x1, x2, · · · , xN}, N1 of which
belong to class ω1, and N2 to class ω2. We seek to obtain a scalar y by projecting
the samples x onto a line
y = wTx
• Of all the possible lines we would like to select the one that maximizes the separability
of the scalars
This is illustrated for the two-dimensional case in the following figures
Introduction to Pattern AnalysisRicardo Gutierrez-OsunaTexas A&M University
2
Linear Discriminant Analysis, two-classes (1)! The objective of LDA is to perform dimensionality reduction while
preserving as much of the class discriminatory information as possible
" Assume we have a set of D-dimensional samples {x(1, x(2, …, x(N}, N1 of which belong to class !1, and N2 to class !2. We seek to obtain a scalar y by projecting the samples x onto a line
" Of all the possible lines we would like to select the one that maximizes the separability of the scalars
! This is illustrated for the two-dimensional case in the following figures
xwy T"
x1
x2
x1
x2
x1
x2
x1
x2
LDA, two-classes(2)
In order to find a good projection vector, we need to define a measure of separation
between the projections
• The mean vector of each class in x and y feature space is
µi =1
Ni
Xx∈ωi
x, µi =1
Ni
Xy∈ωi
y =1
Ni
Xx∈ωi
wTx = xTµi
• We could then choose the distance between the projected means as our objective
function
J(w) = |µ1 − µ2| = |wT(µ1 − µ2)|
• However, the distance between the projected means is not a very good measure since
it does not take into account the standard deviation within the classes.
Why such a measure may be too simplistic
Introduction to Pattern AnalysisRicardo Gutierrez-OsunaTexas A&M University
3
Linear Discriminant Analysis, two-classes (2)! In order to find a good projection vector, we need to define a measure
of separation between the projections! The mean vector of each class in x and y feature space is
! We could then choose the distance between the projected means as our objective function
! However, the distance between the projected means is not a very good measure since it does not take into account the standard deviation within the classes
iT
!x
T
i!yii
!xii µwxw
N1y
N1µandx
N1µ
iii
!!!! """###
~
$ %21T
21 µµwµ~µ~)w(J &!&!
x1
x2
'1
'2This axis yields better class separability
This axis has a larger distance between means
LDA, two-classes(3)
The solution proposed by Fisher is to maximize a function that represents the difference
between the means, normalized by a measure of the within-class scatter.
• For each class we define the scatter, an equivalent of the variance, as
s2i =
Xy∈ωi
(y − µi)2
Then the quantity (s21 + s2
2) is called the within-class scatter of the projected
examples.
• The Fisher linear discriminant is defined as the linear projection wTx that maximizes
the criterion function
J(w) =|µ1 − µ2|2
s21 + s2
2
• Therefore, we will be looking for a projection where examples from the same class
are projected very close to each other and, at the same time, the projected means
are as farther apart as possible.
Introduction to Pattern AnalysisRicardo Gutierrez-OsunaTexas A&M University
4
Linear Discriminant Analysis, two-classes (3)! The solution proposed by Fisher is to maximize a function that represents the
difference between the means, normalized by a measure of the within-class scatter
" For each class we define the scatter, an equivalent of the variance, as
! where the quantity is called the within-class scatter of the projected examples" The Fisher linear discriminant is defined as the linear function wTx that maximizes the
criterion function
" Therefore, we will be looking for a projection where examples from the same class are projected very close to each other and, at the same time, the projected means are as farther apart as possible
! "#$
%&i!y
2i
2i µ~ys~
! "22
21 s~s~ '
22
21
221
s~s~µ~µ~
)w(J'
%&
x1
x2
(1
(2
LDA, two-classes(4)
• To find the optimum projection w∗, we need to express J(w) as an explicit function
of w.
• We define a measure of the scatter in multivariate feature space x, which are scatter
matrices.
SW = S1 + S2, where Si =Xx∈ωi
(x− µi)(x− µi)T
The matrix SW is called the within-class scatter matrix.
• The scatter of the projection y can then be expressed as a function of the scatter
matrix in feature space x.
s2i =
Xy∈ωi
(y − µi)2=
Xx∈ωi
wT(x− µi)(x− µi)
Tw = wTSiw
s21 + s
22 = wT
SWw
• Similarly, the difference between the projected means can be expressed in terms of
the means in the original feature space
(µ1 − µ2)2= wT
(µ1 − µ2)(µ1 − µ2)T| {z }
SB
w = wTSBw
The matrix SB is called the between-class scatter. Note that, since SB is the outer
product of two vectors, its rank is at most one.
• We can finally express the Fisher criterion in terms of SW and SB as
J(w) =wTSBwwTSWw
LDA, two-classes(5)
• To find the maximum of J(w) we differentiate and set to zero
d
dwJ(w) =
d
dw
"wTSBwwTSWw
#= 0
=⇒ (wTSWw)
d
dw
hwT
SBwi− (wT
SBw)d
dw
hwT
SWwi
= 0
=⇒ (wTSWw)2SBw − (wT
SBw)2SWw = 0
=⇒ (wTSWw)SBw = (wT
SBw)SWw
• From the definition of SB we see that SBw ∝ (µ1 − µ2)
• We do not care about the magnitude of w only its direction.
• Then by dropping any scale factors and multiplying both sides by S−1W .
w ∝ S−1W (µ1 − µ2)
• In summary get the Fisher Linear Discriminant (1936)
w∗ = arg maxw
wT SBw
wT SW w= S−1
W (µ1 − µ2).
Although it should be noted it is not a discriminant but rather a specific choice of
direction for the projection of the data down to one dimension.
LDA Classifier
Use w to construct a classifier. Project the training data onto the line w. Must find a
threshold θ such that
wTx ≥ θ =⇒ x belongs to class ω1
wTx < θ =⇒ x belongs to class ω2
How should we learn θ ?
LDA Classifier
Use w to construct a classifier. Project the training data onto the line w. Must find a
threshold θ such that
wTx ≥ θ =⇒ x belongs to class ω1
wTx < θ =⇒ x belongs to class ω2
How should we learn θ ?
• Project the training data from both classes onto w.
• Search for θ such that the number of mis-classifications is minimized.
LDA, example
Compute the Linear Discriminant projection for the following two-dimensional dataset
X1 = {x1, · · · , x5} = {(4, 1), (2, 4), (2, 3), (3, 6), (4, 4)}
X2 = {x1, · · · , x5} = {(9, 10), (6, 8), (9, 5), (8, 7), (10, 8)}
Introduction to Pattern AnalysisRicardo Gutierrez-OsunaTexas A&M University
7
LDA example! Compute the Linear Discriminant projection for
the following two-dimensional dataset" X1=(x1,x2)={(4,1),(2,4),(2,3),(3,6),(4,4)}" X2=(x1,x2)={(9,10),(6,8),(9,5),(8,7),(10,8)}
! SOLUTION (by hand)" The class statistics are:
" The within- and between-class scatter are
" The LDA projection is then obtained as the solution of the generalized eigenvalue problem
" Or directly by
! " ! "7.608.40µ;3.603.00µ2.640.04-0.04-1.84
S;2.600.40-0.40-0.80
S
21
21
##
$%
&'(
)#$
%
&'(
)#
0 2 4 6 8 100
2
4
6
8
10
x2
x1
wLDA
$%
&'(
)#$
%
&'(
)#
5.280.44-0.44-2.64
S;16.0021.6021.6029.16
S WB
* + ! "T211
W 0.390.91µµSw* ,,#,# ,
$%
&'(
)#$
%
&'(
)-$
%
&'(
)#$
%
&'(
)$%
&'(
)
#-#-#,-#
0.390.91
vv
vv
15.65vv
3.765.088.8111.89
15.65!0!-3.765.08
8.81!-11.890!ISS!vvSS
2
1
2
1
2
1
B1-WB
1-W
LDA, example
Compute the Linear Discriminant projection for the following two-dimensional dataset
X1 = {x1, · · · , x5} = {(4, 1), (2, 4), (2, 3), (3, 6), (4, 4)}
X2 = {x1, · · · , x5} = {(9, 10), (6, 8), (9, 5), (8, 7), (10, 8)}
Solution (by hand):
• The class statistics are:
S1 =
„.80 −.40
−.40 2.60
«S2 =
„1.84 −.04
−.04 2.64
«µ1 = (3.0, 3.6)
Tµ2 = (8.4, 7.6)
T
• The within- and between-class scatter are
SB =
„29.16 21.6
21.6 16.0
«SW =
„2.64 −.44
−.44 5.28
«• The LDA projection is then obtained as the solution of the generalized eigenvalue
problem
S−1W SBv = λv =⇒ |S−1
W SB − λI| = 0
=⇒˛
11.89− λ 8.81
5.08 3.76− λ
˛= 0
=⇒ λ = 15.65
Then plugging in to find the eigenvector which is the optimal projection direction„11.89 8.81
5.08 3.76
«v = 15.65v =⇒ v =
„0.91
0.39
«• Or directly by
w∗ = S−1W (µ1 − µ2) = (−.91,−.39)
T
Linear Discriminant Analysis
• Linear Discriminant Analysis, two classes
• Linear Discriminant Analysis, C classes
LDA, C-classes
Fisher’s LDA generalizes very gracefully for C-class problems. Instead of one projection
y, we now seek (C − 1) projections (y1, y2, · · · , yC−1) by means of (C − 1)
projection vectors wi, which can be arranged by columns into a projection matrix
W = [w1w2 · · ·wC−1].
Derivation
• The generalization of the within-class scatter is
SW =CX
i=1
Si, where Si =Xx∈ωi
(x− µi)(x− µi)T
and µi =1
Ni
Xx∈ωi
x
• The generalization for the between-class scatter is
SB =CX
i=1
Ni(µi − µ)(µi − µ)T, where µ =
1
N
X∀x
x =1
N
CXi=1
Niµi
Introduction to Pattern AnalysisRicardo Gutierrez-OsunaTexas A&M University
8
Linear Discriminant Analysis, C-classes (1)! Fisher’s LDA generalizes very gracefully for C-class problems
" Instead of one projection y, we will now seek (C-1) projections [y1,y2,…,yC-1] by means of (C-1) projection vectors wi, which can be arranged by columns into a projection matrix W=[w1|w2|…|wC-1]:
! Derivation" The generalization of the within-class scatter is
" The generalization for the between-class scatter is
! where ST=SB+SW is called the total scatter matrix
! "! " ##
#
$$
%
%&&%
%
ii !xii
!x
Tiii
C
1iiW
xN1µandµxµxSwhere
SS
! "! "
##
#
$'
%
%%
&&%
i!xii
x
C
1i
TiiiB
µNN1x
N1µwhere
µµµµNS
xWyxwy TTii %(%
)1
)2
)3
)
SB1
SB3
SB2SW3
SW1
SW2
x1
x2
)1
)2
)3
)
SB1
SB3
SB2SW3
SW1
SW2
x1
x2
• Similarly, define the mean vector and scatter matrices for the projected samples as
µi =1
Ni
Xy∈ωi
y SW =CX
i=1
Xy∈ωi
(y − µi)(y − µi)T
µ =1
N
X∀y
y SB =CX
i=1
Ni(µi − µ)(µi − µ)T
• From our derivation for the two-class problem, we can write
SW = WTSWW, SB = W
TSBW
• Recall that we are looking for a projection that maximizes the ratio of between-class to within-class scatter. Since the projection is no longer a scalar (it has C−1
dimensions), we then use the determinant of the scatter matrices to obtain a scalar
objective function.
J(W ) =
˛SB
˛˛SW
˛ =
˛W TSBW
˛|W TSWW |
• We seek the projection matrix W ∗ that maximizes this ratio.
LDA, C-classes (2)
It can be shown that the optimal projection matrix W ∗ is the one whose columns are
the eigenvectors corresponding to the largest eigenvalues of the following generalized
eigenvalue problem.
W ∗ =ˆw∗1w
∗2 · · ·w
∗C−1
˜= arg max
W
„ ˛WT SBW
˛|WT SW W |
«=⇒ (SB − λiSW )w∗i = 0
• SB is the sum of C matrices of rank one or less and the mean vectors are constrained
by
µ =1
C
CXi=1
µi
Therefore, SB will be of rank≤ (C−1) =⇒ that only (C−1) of the eigenvalues
λi will be non-zero.
• The projections with maximum class separability information are the eigenvectors
corresponding to the largest eigenvalues of S−1W SB.
• LDA can be derived as the Maximum Likelihood method for the case of normal class
conditional densities with equal covariance matrices.
Linear Discriminant Analysis
• Linear Discriminant Analysis, two classes
• Linear Discriminant Analysis, C classes
• LDA Vs. PCA example
Coffee discrimination with gas sensor array
• Five types of coffee beans were presented
to an array of chemical gas sensors
• For each coffee type, 45 sniffs were
performed and the response of the gas
sensor array was processed in order to
obtain a 60-dimensional feature vector
Introduction to Pattern AnalysisRicardo Gutierrez-OsunaTexas A&M University
11
0 50 100 150 200
-40
-20
0
20
40
60
Sulaw esy Kenya A rabian Sumatra Colombia
Sen
sor r
espo
nse
-260-240
-220-200 70
80
90
10010
15
20
25
30
35
25
131
2
1
5
53
521
53
53
4
3 33
31 41 4 554
5
45
5
3
3
1
4
1
52
5
5
51
5
2
534 2
4 3
13
1
351232
21
452
4
24
1
1
45 3
5
54
414
2
45
4 32
5
53
1
331 32
2 52
32
1
3 44
1
1
21
3
1
1
2553 4
12
312
44 45 41
1
2
4
54
322 3
52222
2
24
532
4 45 52 1515 4 4
4341 1
1 33232412 3 53
214
34 5
2 1 4153 54
3
1
4242 1 52
3 1 51
23 4 51 5135
432 4
41
43
22313
2
axis 1
axis 2
axis
3
PCA
-1.96-1.94
-1.92-1.9
-1.88 0.3
0.35
0.47.32
7.34
7.36
7.38
7.4
7.42
44444
2
44 44
4
2
444
4 44 44
4 44 4
2
4 44 44
44
2
4333 4443 43 54
2
5
2
43
44
3
2
54
2
54
2
53
2
3
222
4 4
22
34
3
2
43
22
3
222
1 31 53
5
2
3
2
1
2
5
2
5
2
3 55
2
3133
2
5
2
35
13 53 55
2
33 151
55
2
3 5535
33
22 22
153 5
31335
13
45
2
1 153
1
2 2
53 5
22
5
2
3113
32
5
2
55
2
51 1 1
3 5331
2
551
5131
1 5 551
11113 11
131 5
11 11
1 5 515
1111 111
axis 1
axis 2
axis
3
LDA
LDA Vs. PCA: Coffee discrimination with a gas sensor array! These figures show the performance of PCA and
LDA on an odor recognition problem" Five types of coffee beans were presented to an array
of chemical gas sensors" For each coffee type, 45 “sniffs” were performed and
the response of the gas sensor array was processed in order to obtain a 60-dimensional feature vector
! Results" From the 3D scatter plots it is clear that LDA
outperforms PCA in terms of class discrimination" This is one example where the discriminatory
information is not aligned with the direction of maximum variance
The following figures show the performance of PCA and LDA on this odor recognition
problem
Results:
• From the 3D scatter plots it is clear that LDA outperforms PCA in terms of class
discrimination
• This is one example where the discriminatory information is not aligned with the
direction of maximum variance
Introduction to Pattern AnalysisRicardo Gutierrez-OsunaTexas A&M University
11
0 50 100 150 200
-40
-20
0
20
40
60
Sulaw esy Kenya A rabian Sumatra Colombia
Sen
sor r
espo
nse
-260-240
-220-200 70
80
90
10010
15
20
25
30
35
25
131
2
1
5
53
521
53
53
4
3 33
31 41 4 554
5
45
5
3
3
1
4
1
52
5
5
51
5
2
534 2
4 3
13
1
351232
21
452
4
24
1
1
45 3
5
54
414
2
45
4 32
5
53
1
331 32
2 52
32
1
3 44
1
1
21
3
1
1
2553 4
12
312
44 45 41
1
2
4
54
322 3
52222
2
24
532
4 45 52 1515 4 4
4341 1
1 33232412 3 53
214
34 5
2 1 4153 54
3
1
4242 1 52
3 1 51
23 4 51 5135
432 4
41
43
22313
2
axis 1
axis 2
axis
3
PCA
-1.96-1.94
-1.92-1.9
-1.88 0.3
0.35
0.47.32
7.34
7.36
7.38
7.4
7.42
44444
2
44 44
4
2
444
4 44 44
4 44 4
2
4 44 44
44
2
4333 4443 43 54
2
5
2
43
44
3
2
54
2
54
2
53
2
3
222
4 4
22
34
3
2
43
22
3
222
1 31 53
5
2
3
2
1
2
5
2
5
2
3 55
2
3133
2
5
2
35
13 53 55
2
33 151
55
2
3 5535
33
22 22
153 5
31335
13
45
2
1 153
1
2 2
53 5
22
5
2
3113
32
5
2
55
2
51 1 1
3 5331
2
551
5131
1 5 551
11113 11
131 5
11 11
1 5 515
1111 111
axis 1
axis 2
axis
3
LDA
LDA Vs. PCA: Coffee discrimination with a gas sensor array! These figures show the performance of PCA and
LDA on an odor recognition problem" Five types of coffee beans were presented to an array
of chemical gas sensors" For each coffee type, 45 “sniffs” were performed and
the response of the gas sensor array was processed in order to obtain a 60-dimensional feature vector
! Results" From the 3D scatter plots it is clear that LDA
outperforms PCA in terms of class discrimination" This is one example where the discriminatory
information is not aligned with the direction of maximum variance
Linear Discriminant Analysis
• Linear Discriminant Analysis, two classes
• Linear Discriminant Analysis, C classes
• LDA Vs. PCA example
• Limitations of LDA
Limitations of LDA
• LDA produces at most C − 1 feature projections
– If the classification error estimates establish that more features are needed, some
other method must be employed to provide those additional features
• LDA is a parametric method since it assumes unimodal Gaussian likelihoods
– If the distributions are significantly non-Gaussian, the LDA projections will not
be able to preserve any complex structure of the data, which may be needed for
classification.
Introduction to Pattern AnalysisRicardo Gutierrez-OsunaTexas A&M University
12
Limitations of LDA! LDA produces at most C-1 feature projections
" If the classification error estimates establish that more features are needed, some other method must be employed to provide those additional features
! LDA is a parametric method since it assumes unimodal Gaussian likelihoods" If the distributions are significantly non-Gaussian, the LDA projections will not be able to
preserve any complex structure of the data, which may be needed for classification
! LDA will fail when the discriminatory information is not in the mean but rather in the variance of the data
!1 !2
!2 !1
"1="2="
!1
!2
"1
"2
"1="2="!1
!2
!1 !2
!2 !1
"1="2="
!1
!2
"1
"2
"1="2="!1
!2
x1
x2
LDAPCA
x1
x2
LDAPCA
• LDA will fail when the discriminatory information is not in the mean but rather in
the variance of the data.
Introduction to Pattern AnalysisRicardo Gutierrez-OsunaTexas A&M University
12
Limitations of LDA! LDA produces at most C-1 feature projections
" If the classification error estimates establish that more features are needed, some other method must be employed to provide those additional features
! LDA is a parametric method since it assumes unimodal Gaussian likelihoods" If the distributions are significantly non-Gaussian, the LDA projections will not be able to
preserve any complex structure of the data, which may be needed for classification
! LDA will fail when the discriminatory information is not in the mean but rather in the variance of the data
!1 !2
!2 !1
"1="2="
!1
!2
"1
"2
"1="2="!1
!2
!1 !2
!2 !1
"1="2="
!1
!2
"1
"2
"1="2="!1
!2
x1
x2
LDAPCA
x1
x2
LDAPCA
Linear Discriminant Analysis
• Linear Discriminant Analysis, two classes
• Linear Discriminant Analysis, C classes
• LDA Vs. PCA example
• Limitations of LDA
• Variants of LDA
Variants of LDA
• Non-parametric LDA (Fukunaga)
– NPLDA removes the unimodal Gaussian assumption by computing the between-
class scatter matrix SB using local information and the K Nearest Neighbors
rule.
∗ The matrix SB is full-rank, allowing the extraction of more than (C − 1)
features
∗ The projections are able to preserve the structure of the data more closely
• Orthonormal LDA (Okada and Tomita)
– OLDA computes projections that maximize the Fisher criterion and, at the same
time, are pair-wise orthonormal
∗ The method used in OLDA combines the eigenvalue solution of S−1W SB and
the Gram-Schmidt ortho-normalization procedure.
∗ OLDA sequentially finds axes that maximize the Fisher criterion in the subspace
orthogonal to all features already extracted
∗ OLDA is also capable of finding more than (C − 1) features
• Generalized LDA (Lowe)
– GLDA generalizes the Fisher criterion by incorporating a cost function similar to
the one used to compute the Bayes Risk
∗ The effect of this generalized criterion is an LDA projection with a structure
that is biased by the cost function
∗ Classes with a higher cost Cij will be placed further apart in the low-dimensional
projection
• Multilayer Perceptrons (Webb and Lowe)
– It has been shown that the hidden layers of multi-layer perceptrons (MLP) perform
non-linear discriminant analysis by maximizing tr[SBST ], where the scatter
matrices are measured at the output of the last hidden layer.
Linear Discriminant Analysis
• Linear Discriminant Analysis, two classes
• Linear Discriminant Analysis, C classes
• LDA Vs. PCA example
• Limitations of LDA
• Variants of LDA
• Other dimensionality reduction methods
Other dimensionality reductionmethods
Exploratory Projection Pursuit (Friedman and Tukey)
• EPP seeks an M -dimensional (M = 2, 3 typically) linear projection of the data that
maximizes a measure of interestingness.
• Interestingness is measured as departure from multivariate normality.
– This measure is not the variance and is commonly scale-free. In most proposals
it is also affine invariant, so it does not depend on correlations between features.
[Ripley, 1996]
• In other words, EPP seeks projections that separate clusters as much as possible and
keeps these clusters compact, a similar criterion as Fisher’s, but EPP does NOT use
class labels.
• Once an interesting projection is found, it is important to remove the structure it
reveals to allow other interesting views to be found more easily.
Introduction to Pattern AnalysisRicardo Gutierrez-OsunaTexas A&M University
14
Other dimensionality reduction methods (1)! Exploratory Projection Pursuit (Friedman and Tukey)
" EPP seeks an M-dimensional (M=2,3 typically) linear projection of the data that maximizes a measure of “interestingness”
" Interestingness is measured as departure from multivariate normality! This measure is not the variance and is commonly scale-free. In most proposals it is also affine
invariant, so it does not depend on correlations between features . [Ripley, 1996]" In other words, EPP seeks projections that separate clusters as much as possible and keeps
these clusters compact, a similar criterion as Fisher’s, but EPP does NOT use class labels" Once an interesting projection is found, it is important to remove the structure it reveals to
allow other interesting views to be found more easily
x1
x2
x1
x2
INTERESTINGUNINTERESTING Other dimensionality reductionmethods
Sammon’s Non-linear Mapping (Sammon)
• This method seeks a mapping onto an M -dimensional space that preserves the
inter-point distances of the original N -dimensional space.
– This is accomplished by minimizing the following objective function
Cost(d, d′) =
Xi 6=j
“d(Pi, Pj)− d(P′i, P
′j)
”2
d(Pi, Pj)
∗ The original method did not obtain an explicit mapping but only a lookup table
for the elements in the training set
∗ Recent implementations using artificial neural networks (MLPs and RBFs) do
provide an explicit mapping for test data and also consider cost functions
(Neuroscale)
∗ Sammon’s mapping is closely related to Multi-Dimensional Scaling (MDS), a
family of multivariate statistical methods commonly used in the social sciences
Introduction to Pattern AnalysisRicardo Gutierrez-OsunaTexas A&M University
15
Other dimensionality reduction methods (2)! Sammon’s Non-linear Mapping (Sammon)
" This method seeks a mapping onto an M-dimensional space that preserves the inter-point distances of the original N-dimensional space
! This is accomplished by minimizing the following objective function
" The original method did not obtain an explicit mapping but only a lookup table for the elements in the training set
" Recent implementations using artificial neural networks (MLPs and RBFs) do provide an explicit mapping for test data and also consider cost functions (Neuroscale)
" Sammon’s mapping is closely related to Multi-Dimensional Scaling (MDS), a family of multivariate statistical methods commonly used in the social sciences
! "#$
%&
ji ji
2'j
'iji
)P,P(d)P,P(d)P,P(d
)'d,d(E
P’j
x1
x2
P’iPi
Pj
x3
x2
x1
d(Pi, Pj)= d(P’i, P’j) ' i,j
P’j
x1
x2
P’i
P’j
x1
x2
P’iPi
Pj
x3
x2
x1
d(Pi, Pj)= d(P’i, P’j) ' i,j