Upload
jeffry-wells
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Why to reduce the number of the features?
• Having D features, we want to reduce their number to n, where n<<D
• Benefits: Lower computing complexity
Improvement of the classification performance
• Danger: Possible loss of information
Basic approaches to DR
• Feature extraction
Transform t : RD Rn
Creation of a new feature space. The features lose their original meaning.
• Feature selection
Selection of a subset of the original features.
Principal Component Transform (Karhunen-Loeve)
• PCT belongs to “feature extraction”, t is a rotation
y = Tx,
T is a matrix of the eigenvectors of the original covariance matrix Cx
• PCT creates D new uncorrelated features y,
Cy = T’ Cx T
• n features with the highest variations are kept
Principal Component Transform
Applications of the PCT
• “Optimal” data representation, compaction of the energy
• Visualization and compression of multimodal images
PCT of multispectral imagesPCT of multispectral images
Satellite image: B, G, R, nIR, IR, thermal IR
Why is PCT bad for classification purposes?
PCT evaluates the contribution of individual features solely by their variation, which may be different from their discrimination power.
Why is PCT bad for classification purposes?
Separability problem
• Dimensionality reduction methods for classification purposes (Two-class problem) must consider the discrimination power of individual features.
• The goal is to maximize the “distance” between the classes.
An Example3 classes, 3D feature space, reduction to 2D
High discriminability Low discriminability
DR via feature selection
Two things needed:
• Discriminability measure
(Mahalanobis distance, Bhattacharyya distance)
MD12 = (m1 – m2)(C1 + C2)-1 (m1 – m2)’
• Selection strategy
Feature selection optimization problem
Feature selection strategies
• Optimal - full search, complexity D!/(D-n)!n! - branch & bound
• Sub-optimal - direct selection (optimal if the features are not
correlated) - sequential selection (SFS, SBS) - generalized sequential selection (SFS(k), Plus
k minus m, floating search)
A priori knowledge in feature selection
• The above discriminability measures (MD, BD)
require normally distributed classes.
They are misleading and inapplicable otherwise.
A priori knowledge in feature selection
• The above discriminability measures (MD, BD)
require normally distributed classes.
They are misleading and inapplicable otherwise.
• Crucial questions in practical applications:
Can the class-condicional distributions be assumed to be normal?
What happens if this assumption is wrong?
A two-class example
Class 2Class 1
x2 is selected
x1 is selected
Conclusion
• PCT is optimal for representation of “one-class” data (visualization, compression, etc).
• PCT should not be used for classification purposes. Use feature selection methods based on a proper discriminability measure.
• If you still use PCT before classification, be aware of possible errors.