Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Feature Selection for Image Retrieval and Object Recognition
Nuno Vasconcelos et al.Statistical Visual Computing Lab
ECE, UCSD
Presented by Dashan Gao
2
Scalable Discriminant Feature Selection for Image Retrieval and RecognitionN. Vasconcelos and M. VasconcelosTo appear in IEEE CVPR 2004
Feature Selection by Maximum Marginal Diversity: optimality and implications for visual recognitionN. VasconcelosProceedings of IEEE CVPR, 2003
Feature Selection by Maximum Marginal DiversityN. VasconcelosProceedings of Neural Information Processing Systems, 2002.
3
Overview (1)Image retrieval is a large scale classification problem:
A large number of classes, large amounts of data per class
A discriminant feature space (of small dimensionality) is a pre-requisite for success
Feature Selection (FS) makes learning easier and tractable, in a lower dimensional feature space X
Goal: Find transformation T, constrained to be a subset projection
Find the projection matrix T that optimizes a criterion for “feature goodness”
4
Overview (2)Weaknesses of traditional methods:
Based on sub-optimal criteria: variance maximization (principal component analysis PCA)Lack of scalability: they take infeasible time to computeDifficult to extend to multi-class problems (boosting)
Ultimate goal: minimize probability of error (MPE)Search for the Bayes error-optimal space of a given classification problem
Achievable goal (discriminant sense) : maximize separation between the different classes to recognize.
5
Information theoretic feature selection (ITFS)
Infomax goal: maximize mutual information between the selected features and class labelsOutline:
Optimality properties (in MPE and discriminantsense) (Contribution 1)Trade-off between optimality and complexity(Contribution 2)Algorithmic implementation with low complexity
6
Bayes Error (BE)
Advantage:BE depends only on the feature space, thus is the ultimate discriminant measure for FS.
Disadvantage: nonlinearity of max(.) operation
7
Infomax principle
H(X) H(Y)
H(XIY) H(Y|X)
H(.) is entropy
H(Y|X) is conditional entropy (class posterior entropy, CPE)
max I(Y;X) = min H(Y|X)
8
Infomax example2 classes (M=2), 2 features ,
Note: Variance-based criteria (e.g. PCA) fail in this case!!
x1 x1 x2
x 2
9
Infomax vs BETo show: Bayes error >= Infomax
10
Example
Important observation:• The gradients of the two curve
have the same signs everywhere when defined
• The extrema of both sides are co-located
LHS and RHS have the same optimization solution
-0.8
0.6
0
11
Infomax vs BEBayes error >= Infomax
12
Infomax is optimal in MPE sense!Infomax is a good approximation of BE.The infomax solutions will be very similar to those BE.
example: M=2
CPE (H(Y|X)BE
µ
BE and CPE as functions of m
13
Discriminant form of infomax
Infomax goal is equivalent to the goal that maximizes separation between the different classes
Noting that
Theorem 3:
14
Feature Selection (FS)Forward sequential search for FS:
A set of features are added to the current best subset in each step, with the goal of optimizing a cost function
Denote the current subset by , the added features by , and the new subset by . We can prove
or
Maximizing mutual information (infomax) is simpler than minimizing BE
15
Proof
Proof:
16
Feature Selection (cont’d)
A trade-off between the maximization of discriminant power and the minimization of redundancyProblem: Infomax requires high-dimensional density estimatesFind a trade-off between optimality and complexity
favors discriminant featurespenalizes features redundant with previous
unless redundancy provides information about Y
17
Maximum Marginal Diversity (MMD)Marginal Diversity
MMD based FS: a naïve infomaxSelect the subset of features that lead to a set of maximally diverse marginal densities.
Optimality conditionLemma : MMD is optimal if the following holds:
the average mutual information between features is not affected by the knowledge of the class label
18
the Naïve Bayes ClassifierAssumption: features are only conditional independent given the class label
however, the optimality condition for MMD doesn’t holdunder this assumption. Since
Feature selected by MMD are not good for Naïve BayesClassifier!
19
MMD (continued)
Fortunately, recent studies show that, for image recognition problems, MMD is very close to the optimal solution for the biologically plausible features, e.g. wavelet coefficients
Advantage: Computation is simple: only marginal distribution of each feature is considered.
Disadvantage: The existence of optimality condition can hardly be proved practically.There is no guarantee for optimality if the condition does not hold.
20
Image statisticsFeature dependencies tend to be localized across both space and image scalee.g. for standard wavelet decomposition:
co-located coefficients of equal orientation can be arbitrarily dependent on the classaverage dependence between such sets of coefficients does not depend on the image class (strong V freq => weak H freq)
This property is referred to as a more general casethanMMD: l-decomposability:
feature set decomposable into mutually exclusive subsets of lthorderfeatures within subsets arbitrarily dependent, no constraintsdependence between subsets does not depend on image class
21
More general caseAll the features are grouped as a collection of disjoint subsetsThe features within each subset are allowed to have arbitrary dependencies The dependencies between the subsets are constrained to be non-informative
22
A family of FS algorithmsl-decomposability
23
A family of FS Algorithms (cont’d)Theorem
The optimal infomax FS solution only requires density estimates of dimension
24
A family of FS Algorithms (cont’d)
Parameter is a trade-off between optimality and complexity
, sub-optimal but computationally efficient= 0, MMD case, all the features depend in a non-
informative way.= n, all features depend in informative ways,
optimal but computational unscalable
25
Infomax-based FS Algorithm
26
Algorithm ComplexitySuppose C classes, F feature vectors per class, histogram with b bins along each axis
27
Experiments on MMD (1)A Simple example (the optimal feature subsets are known)Tow Gaussian classes of identity covariance and means , n = 20Compare the “average feature selection quality” between with Jain&Zongker’s result (Mahalanobis distance)
Ave
rage
Qua
lity
“feature selection quality”: ratio between the correctly selected features and n
# of training samples # of training samples
In this sample, the optimal condition of MMD is satisfied
better
MMDBranch and bound
SFS
28
Experiments on MMD (2)Brodatz texture-base classification 112 texture classes, 64(8*8) dimensional feature space, classifiers based on Gaussian mixutures
# of features # of features
Cum
ulat
ive
MD
Cla
ssifi
catio
n A
ccur
acy
29
Experiments on MMD (3)Image retrieval on Brodatz texture database
PRA
PRA: area under precision/recall curve
# of features # of features
MD
30
Experiments on MMD (4)Features as filters
projection of the textures onto the five most informative basis functionsdetectors of lines, corners, t-junctions and so forth
31
Experiment on infomax (1)Image retrieval on Corel image database (15 classes, 1500 images)Different size of the clusters ( )
l=1l=2l=0variance
# of features
Main observations:• ITFS can
significantly outperform variance-based methods (10 vs 30 features for equivalent PRA)
• for ITFS there is no noticeable gain for l > 1!
PRA
32
Experiment on infomax (2)Different number of histogram bins
# of features
Main observations:• Infomax-based
FS is quite insensitive to the quality of the estimates (no noticeable variation above 8 bins per axis, small degradation for 4)
• Always significantly better than variance
PRA
33
Experiment on infomax (3)Image retrieval results on Corel
34
ConclusionInfomax based feature selection is optimal in MPE senseAn explicit understanding of the trade-off between optimality and complexity, and the corresponding optimality condition implied by infomax (Most important contribution)
A scalable Infomax-based FS algorithm for image retrieval and recognitionFuture work:
Evaluation of optimality and efficiency of this infomax-based algorithm on other features (such as rectangular features in Viola&Jones’ face detector) and classification problems.