Feature Selection for Image Retrieval and Object Recognitionelkan/254/MMD_dashan.pdf19 MMD (continued) Fortunately, recent studies show that, for image recognition problems, MMD is

Feature Selection for Image Retrieval and Object Recognition

Nuno Vasconcelos et al.Statistical Visual Computing Lab

ECE, UCSD

Presented by Dashan Gao

2

Scalable Discriminant Feature Selection for Image Retrieval and RecognitionN. Vasconcelos and M. VasconcelosTo appear in IEEE CVPR 2004

Feature Selection by Maximum Marginal Diversity: optimality and implications for visual recognitionN. VasconcelosProceedings of IEEE CVPR, 2003

Feature Selection by Maximum Marginal DiversityN. VasconcelosProceedings of Neural Information Processing Systems, 2002.

3

Overview (1)Image retrieval is a large scale classification problem:

A large number of classes, large amounts of data per class

A discriminant feature space (of small dimensionality) is a pre-requisite for success

Feature Selection (FS) makes learning easier and tractable, in a lower dimensional feature space X

Goal: Find transformation T, constrained to be a subset projection

Find the projection matrix T that optimizes a criterion for “feature goodness”

4

Overview (2)Weaknesses of traditional methods:

Based on sub-optimal criteria: variance maximization (principal component analysis PCA)Lack of scalability: they take infeasible time to computeDifficult to extend to multi-class problems (boosting)

Ultimate goal: minimize probability of error (MPE)Search for the Bayes error-optimal space of a given classification problem

Achievable goal (discriminant sense) : maximize separation between the different classes to recognize.

5

Information theoretic feature selection (ITFS)

Infomax goal: maximize mutual information between the selected features and class labelsOutline:

Optimality properties (in MPE and discriminantsense) (Contribution 1)Trade-off between optimality and complexity(Contribution 2)Algorithmic implementation with low complexity

6

Bayes Error (BE)

Advantage:BE depends only on the feature space, thus is the ultimate discriminant measure for FS.

Disadvantage: nonlinearity of max(.) operation

7

Infomax principle

H(X) H(Y)

H(XIY) H(Y|X)

H(.) is entropy

H(Y|X) is conditional entropy (class posterior entropy, CPE)

max I(Y;X) = min H(Y|X)

8

Infomax example2 classes (M=2), 2 features ,

Note: Variance-based criteria (e.g. PCA) fail in this case!!

x1 x1 x2

x 2

9

Infomax vs BETo show: Bayes error >= Infomax

10

Example

Important observation:• The gradients of the two curve

have the same signs everywhere when defined

• The extrema of both sides are co-located

LHS and RHS have the same optimization solution

-0.8

0.6

0

11

Infomax vs BEBayes error >= Infomax

12

Infomax is optimal in MPE sense!Infomax is a good approximation of BE.The infomax solutions will be very similar to those BE.

example: M=2

CPE (H(Y|X)BE

µ

BE and CPE as functions of m

13

Discriminant form of infomax

Infomax goal is equivalent to the goal that maximizes separation between the different classes

Noting that

Theorem 3:

14

Feature Selection (FS)Forward sequential search for FS:

A set of features are added to the current best subset in each step, with the goal of optimizing a cost function

Denote the current subset by , the added features by , and the new subset by . We can prove

or

Maximizing mutual information (infomax) is simpler than minimizing BE

15

Proof

Proof:

16

Feature Selection (cont’d)

A trade-off between the maximization of discriminant power and the minimization of redundancyProblem: Infomax requires high-dimensional density estimatesFind a trade-off between optimality and complexity

favors discriminant featurespenalizes features redundant with previous

unless redundancy provides information about Y

17

Maximum Marginal Diversity (MMD)Marginal Diversity

MMD based FS: a naïve infomaxSelect the subset of features that lead to a set of maximally diverse marginal densities.

Optimality conditionLemma : MMD is optimal if the following holds:

the average mutual information between features is not affected by the knowledge of the class label

18

the Naïve Bayes ClassifierAssumption: features are only conditional independent given the class label

however, the optimality condition for MMD doesn’t holdunder this assumption. Since

Feature selected by MMD are not good for Naïve BayesClassifier!

19

MMD (continued)

Fortunately, recent studies show that, for image recognition problems, MMD is very close to the optimal solution for the biologically plausible features, e.g. wavelet coefficients

Advantage: Computation is simple: only marginal distribution of each feature is considered.

Disadvantage: The existence of optimality condition can hardly be proved practically.There is no guarantee for optimality if the condition does not hold.

20

Image statisticsFeature dependencies tend to be localized across both space and image scalee.g. for standard wavelet decomposition:

co-located coefficients of equal orientation can be arbitrarily dependent on the classaverage dependence between such sets of coefficients does not depend on the image class (strong V freq => weak H freq)

This property is referred to as a more general casethanMMD: l-decomposability:

feature set decomposable into mutually exclusive subsets of lthorderfeatures within subsets arbitrarily dependent, no constraintsdependence between subsets does not depend on image class

21

More general caseAll the features are grouped as a collection of disjoint subsetsThe features within each subset are allowed to have arbitrary dependencies The dependencies between the subsets are constrained to be non-informative

22

A family of FS algorithmsl-decomposability

23

A family of FS Algorithms (cont’d)Theorem

The optimal infomax FS solution only requires density estimates of dimension

24

A family of FS Algorithms (cont’d)

Parameter is a trade-off between optimality and complexity

, sub-optimal but computationally efficient= 0, MMD case, all the features depend in a non-

informative way.= n, all features depend in informative ways,

optimal but computational unscalable

25

Infomax-based FS Algorithm

26

Algorithm ComplexitySuppose C classes, F feature vectors per class, histogram with b bins along each axis

27

Experiments on MMD (1)A Simple example (the optimal feature subsets are known)Tow Gaussian classes of identity covariance and means , n = 20Compare the “average feature selection quality” between with Jain&Zongker’s result (Mahalanobis distance)

Ave

rage

Qua

lity

“feature selection quality”: ratio between the correctly selected features and n

# of training samples # of training samples

In this sample, the optimal condition of MMD is satisfied

better

MMDBranch and bound

SFS

28

Experiments on MMD (2)Brodatz texture-base classification 112 texture classes, 64(8*8) dimensional feature space, classifiers based on Gaussian mixutures

# of features # of features

Cum

ulat

ive

MD

Cla

ssifi

catio

n A

ccur

acy

29

Experiments on MMD (3)Image retrieval on Brodatz texture database

PRA

PRA: area under precision/recall curve

# of features # of features

MD

30

Experiments on MMD (4)Features as filters

projection of the textures onto the five most informative basis functionsdetectors of lines, corners, t-junctions and so forth

31

Experiment on infomax (1)Image retrieval on Corel image database (15 classes, 1500 images)Different size of the clusters ( )

l=1l=2l=0variance

# of features

Main observations:• ITFS can

significantly outperform variance-based methods (10 vs 30 features for equivalent PRA)

• for ITFS there is no noticeable gain for l > 1!

PRA

32

Experiment on infomax (2)Different number of histogram bins

# of features

Main observations:• Infomax-based

FS is quite insensitive to the quality of the estimates (no noticeable variation above 8 bins per axis, small degradation for 4)

• Always significantly better than variance

PRA

33

Experiment on infomax (3)Image retrieval results on Corel

34

ConclusionInfomax based feature selection is optimal in MPE senseAn explicit understanding of the trade-off between optimality and complexity, and the corresponding optimality condition implied by infomax (Most important contribution)

A scalable Infomax-based FS algorithm for image retrieval and recognitionFuture work:

Evaluation of optimality and efficiency of this infomax-based algorithm on other features (such as rectangular features in Viola&Jones’ face detector) and classification problems.

Documents

Feature Selection for Image Retrieval and Object Recognitionelkan/254/MMD_dashan.pdf19 MMD (continued) Fortunately, recent studies show that, for image recognition problems, MMD is