16
Why to reduce the number of the features? • Having D features, we want to reduce their number to n, where n<<D Benefits: Lower computing complexity Improvement of the classification performance Danger: Possible loss of information

Why to reduce the number of the features? Having D features, we want to reduce their number to n, where n

Embed Size (px)

Citation preview

Page 1: Why to reduce the number of the features? Having D features, we want to reduce their number to n, where n

Why to reduce the number of the features?

• Having D features, we want to reduce their number to n, where n<<D

• Benefits: Lower computing complexity

Improvement of the classification performance

• Danger: Possible loss of information

Page 2: Why to reduce the number of the features? Having D features, we want to reduce their number to n, where n

Basic approaches to DR

• Feature extraction

Transform t : RD Rn

Creation of a new feature space. The features lose their original meaning.

• Feature selection

Selection of a subset of the original features.

Page 3: Why to reduce the number of the features? Having D features, we want to reduce their number to n, where n

Principal Component Transform (Karhunen-Loeve)

• PCT belongs to “feature extraction”, t is a rotation

y = Tx,

T is a matrix of the eigenvectors of the original covariance matrix Cx

• PCT creates D new uncorrelated features y,

Cy = T’ Cx T

• n features with the highest variations are kept

Page 4: Why to reduce the number of the features? Having D features, we want to reduce their number to n, where n

Principal Component Transform

Page 5: Why to reduce the number of the features? Having D features, we want to reduce their number to n, where n

Applications of the PCT

• “Optimal” data representation, compaction of the energy

• Visualization and compression of multimodal images

Page 6: Why to reduce the number of the features? Having D features, we want to reduce their number to n, where n

PCT of multispectral imagesPCT of multispectral images

Satellite image: B, G, R, nIR, IR, thermal IR

Page 7: Why to reduce the number of the features? Having D features, we want to reduce their number to n, where n

Why is PCT bad for classification purposes?

PCT evaluates the contribution of individual features solely by their variation, which may be different from their discrimination power.

Page 8: Why to reduce the number of the features? Having D features, we want to reduce their number to n, where n

Why is PCT bad for classification purposes?

Page 9: Why to reduce the number of the features? Having D features, we want to reduce their number to n, where n

Separability problem

• Dimensionality reduction methods for classification purposes (Two-class problem) must consider the discrimination power of individual features.

• The goal is to maximize the “distance” between the classes.

Page 10: Why to reduce the number of the features? Having D features, we want to reduce their number to n, where n

An Example3 classes, 3D feature space, reduction to 2D

High discriminability Low discriminability

Page 11: Why to reduce the number of the features? Having D features, we want to reduce their number to n, where n

DR via feature selection

Two things needed:

• Discriminability measure

(Mahalanobis distance, Bhattacharyya distance)

MD12 = (m1 – m2)(C1 + C2)-1 (m1 – m2)’

• Selection strategy

Feature selection optimization problem

Page 12: Why to reduce the number of the features? Having D features, we want to reduce their number to n, where n

Feature selection strategies

• Optimal - full search, complexity D!/(D-n)!n! - branch & bound

• Sub-optimal - direct selection (optimal if the features are not

correlated) - sequential selection (SFS, SBS) - generalized sequential selection (SFS(k), Plus

k minus m, floating search)

Page 13: Why to reduce the number of the features? Having D features, we want to reduce their number to n, where n

A priori knowledge in feature selection

• The above discriminability measures (MD, BD)

require normally distributed classes.

They are misleading and inapplicable otherwise.

Page 14: Why to reduce the number of the features? Having D features, we want to reduce their number to n, where n

A priori knowledge in feature selection

• The above discriminability measures (MD, BD)

require normally distributed classes.

They are misleading and inapplicable otherwise.

• Crucial questions in practical applications:

Can the class-condicional distributions be assumed to be normal?

What happens if this assumption is wrong?

Page 15: Why to reduce the number of the features? Having D features, we want to reduce their number to n, where n

A two-class example

Class 2Class 1

x2 is selected

x1 is selected

Page 16: Why to reduce the number of the features? Having D features, we want to reduce their number to n, where n

Conclusion

• PCT is optimal for representation of “one-class” data (visualization, compression, etc).

• PCT should not be used for classification purposes. Use feature selection methods based on a proper discriminability measure.

• If you still use PCT before classification, be aware of possible errors.