Why to reduce the number of the features? Having D features, we want to reduce their number to n, where n

Why to reduce the number of the features?

• Having D features, we want to reduce their number to n, where n<<D

• Benefits: Lower computing complexity

Improvement of the classification performance

• Danger: Possible loss of information

Basic approaches to DR

• Feature extraction

Transform t : RD Rn

Creation of a new feature space. The features lose their original meaning.

• Feature selection

Selection of a subset of the original features.

Principal Component Transform (Karhunen-Loeve)

• PCT belongs to “feature extraction”, t is a rotation

y = Tx,

T is a matrix of the eigenvectors of the original covariance matrix Cx

• PCT creates D new uncorrelated features y,

Cy = T’ Cx T

• n features with the highest variations are kept

Principal Component Transform

Applications of the PCT

• “Optimal” data representation, compaction of the energy

• Visualization and compression of multimodal images

PCT of multispectral imagesPCT of multispectral images

Satellite image: B, G, R, nIR, IR, thermal IR

Why is PCT bad for classification purposes?

PCT evaluates the contribution of individual features solely by their variation, which may be different from their discrimination power.

Why is PCT bad for classification purposes?

Separability problem

• Dimensionality reduction methods for classification purposes (Two-class problem) must consider the discrimination power of individual features.

• The goal is to maximize the “distance” between the classes.

An Example3 classes, 3D feature space, reduction to 2D

High discriminability Low discriminability

DR via feature selection

Two things needed:

• Discriminability measure

(Mahalanobis distance, Bhattacharyya distance)

MD12 = (m1 – m2)(C1 + C2)-1 (m1 – m2)’

• Selection strategy

Feature selection optimization problem

Feature selection strategies

• Optimal - full search, complexity D!/(D-n)!n! - branch & bound

• Sub-optimal - direct selection (optimal if the features are not

correlated) - sequential selection (SFS, SBS) - generalized sequential selection (SFS(k), Plus

k minus m, floating search)

A priori knowledge in feature selection

• The above discriminability measures (MD, BD)

require normally distributed classes.

They are misleading and inapplicable otherwise.

A priori knowledge in feature selection

• The above discriminability measures (MD, BD)

require normally distributed classes.

They are misleading and inapplicable otherwise.

• Crucial questions in practical applications:

Can the class-condicional distributions be assumed to be normal?

What happens if this assumption is wrong?

A two-class example

Class 2Class 1

x2 is selected

x1 is selected

Conclusion

• PCT is optimal for representation of “one-class” data (visualization, compression, etc).

• PCT should not be used for classification purposes. Use feature selection methods based on a proper discriminability measure.

• If you still use PCT before classification, be aware of possible errors.

Documents

Why to reduce the number of the features? Having D features, we want to reduce their number to n, where n