Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
TOWARDS LARGE-SCALE FINE-GRAINED CATEGORY LEARNING
TREVOR DARRELL UC BERKELEY EECS & ICSI
TOWARDS LARGE-SCALE FINE-GRAINED CATEGORY LEARNING:
POSE NORMALIZATION, HIERARCHICAL GENERALIZATION, AND TIMELY DETECTION
TREVOR DARRELL UC BERKELEY EECS & ICSI
TOWARDS LARGE-SCALE FINE-GRAINED CATEGORY LEARNING:
POSE NORMALIZATION, HIERARCHICAL GENERALIZATION, AND TIMELY DETECTION
TREVOR DARRELL UC BERKELEY EECS & ICSI
RYAN FARRELL, NING ZHANG, YANGQING JIA, JOSHUA ABBOTT, JOESEPH AUSTERWEIL, TOM GRIFFITHS, SERGEY KARAYEV, TOBI
BAUMGARTNER, MARIO FRITZ
CATEGORIZATION SPECTRUM
CALTECH 101 (2004) Fei-Fei, Fergus, Perona CALTECH 256 (2007)
Griffin, Holub, Perona
Animals with Attributes (2009) Lampert, Nickisch, Harmeling, Weidmann Exploration of Methods for Automatic
Whale Identification (2001) Heusel , Gunn
Identifying Individual Salamanders (2007) Gamble, Ravela, McGarigal Labeled Faces in the Wild (2007)
Huang, Ramesh, Berg, Learned-Miller
STONEFLY9 (2009) Martínez-Muñoz, Zhang, Payet, Todorovic, Larios, Yamamuro,
Lytle, Moldenke, Mortensen, Paasch, Shapiro, Dietterich Caltech/UCSD Birds 200 (2010)
Welinder, Branson, Mita, Wah, Schroff, Belongie, Perona
Visual Identification of Plant Species (2008) Belhumeur, Chen, Feiner, Jacobs, Kress,
Ling, Lopez, Ramamoorthi, Sheorey, White, Zhang
Oxford Flowers (2006-09) Nilsback, Zisserman
Fine-grained recognition bridges traditional instance and category- level tasks...
TOWARDS LARGE-SCALE FINE-GRAINED RECOGNITION
Tasks are relatively difficult even for humans: bird subspecies, vehicle brands, etc.
People learn them from small numbers of positive examples
Both local feature geometry and statistical appearance are salient.
Key differences vs. traditional recognition: – Distinctive fine-grained features are often relative
to object configuration – Degree of generalization varies across category
hierarchies – Finest grained categories may have few training
examples – Can’t detect everything all the time in a large-
scale setting
TOWARDS LARGE-SCALE FINE-GRAINED RECOGNITION
Today: • Pose Pooling Kernels
for Sub-category Recognition
• Generalization in Large-Scale Concept Hierarchies
• Timely Recognition
POSE POOLING KERNELS FOR SUB-CATEGORY RECOGNITION
NING ZHANG, RYAN FARRELL, TREVOR DARRELL CVPR 2012
FINE-GRAINED VISUAL CATEGORIZATION
DISCRIMINATIVE DETAILS MAY BE HIGHLY LOCALIZED AND POSE RELATIVE
Scarlet Tanager Photo by Paul O’Toole
Summer Tanager Photo by Liam Wolff
Summer Tanager Photo by Patti Shoupe
Snowy Egret Photo by Shelley Rutkin
Great Egret Photo by James Hawkins
“BIRDLETS” [ICCV11] APPROACH
“Blue-Headed Vireo”!
DETECTION OF VOLUMETRIC PRIMITIVES
POSE-NORMALIZED APPEARANCE SPACE
2D VS. 3D
3D Representation 2D Representation
Pros • Most accurate representation for volumetric objects
• Facilitates pose-normalized appearance model
• Far easier to train (less complex and less costly annotations)
• Far more tolerant of localization errors
Cons • Accurate detection is challenging
• Annotations are costly • Volumetric part model
doesn’t capture flat parts such as wings
• Less precise part localization can attenuate key discriminative features
A classic story: recognition via detection
faces detection
/part localization
Alignment/normalization
identification/ attribute learning
Traditional Spatial Pooling
“POSE DEPENDENT” POOLING
OVERVIEW
“Red-Eyed Vireo”
Pose-Normalized Representation
Annotated images
Poselet training
Poselet activations
SVM classification
POSELET TRAINING
Parts: back, beak, belly, breast, crown, forehead, left eye, left leg, left wing, nape, right eye, right leg, right wing, , throat
Collect positive examples from other
training images
POSELET TRAINING
Bourdev and Malik, in ICCV 2009.
POSELET ACTIVATION (PREDICTION)
Poselet #90
Poselet #66
Poselet #65
COMPARING POSELETS
• Local image features from poselets which overlap in 3D are pooled together
POSE NORMALIZED REPRESENTATION
• Cluster based on warping distance or keypoint distance
POSELET CLUSTERS
Pose pooling vs. spatial pyramid
Pose pooling
EXPERIMENTAL RESULTS
P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, P. Perona. CalTech/UCSD, 2010. http://www.vision.caltech.edu/visipedia/CUB-200.html
27
EXPERIMENTAL RESULTS
28
• Implementation Details – Poselet Activations
• Activations are generated by comparing each poselet’s keypoint distributions with the annotated keypoint locations on each image.
– Appearance Descriptor • Bag of words on SIFT feature / Spatial Pyramid Match on top of
SIFT.
– SVM classifier • linear SVM / fast implementation of χ2 and Intersection kernel.
– Baseline methods • VLFEAT toolbox-SIFT on the bounding-box object w.o pose
normalized information.
CATEGORIZATION RESULTS
29
Baseline - VLFeat 29.73% (MAP)
Warped Feature 36.33% (MAP)
Pose Pooling Kernel 40.60% (MAP)
Confusion matrices on 14 categories across two bird families {vireos, woodpeckers}.
MORE RESULTS
30
• Detection • Better descriptors
– Physically inspired color representations – Salience cues to reduce false positive poselet
responses
• Experiments on larger datasets…
31
FUTURE WORK
Caltech UCSD Berkeley
For more information contact: Ryan Farrell - [email protected]
“Sweet” Taster-25 (released June 2012)
“Bitter” Sparrows-33 (will be released this Fall with the full
700+ category dataset)
Bounding Box
Segmentation Mask
Part Keypoints
Part Regions
Attributes
Among the Key Innovations • Photos collected via enthusiast photographer
submissions and annotated by citizen scientists • Expert-curated collection (category for each image
is verified by a domain expert) • Deep Taxonomy and tree-based distance measures
The dataset is being collected with tools designed for re-use in other fine-grained domains
D( , ) << D( , )
Caltech UCSD Berkeley
For more information contact: Ryan Farrell - [email protected]
Today: • Pose Pooling Kernels
for Sub-category Recognition
• Generalization in Large-Scale Concept Hierarchies
• Timely Recognition
BAYESIAN CONCEPT GENERALIZATION
Yangqing Jia, Joshua Abbott, Joe Austerweil,
Tom Griffiths, Trevor Darrell
TOWARDS LARGE-SCALE FINE-GRAINED CATEGORY RECOGNITION
Current ML algorithms: – hundreds of positive
examples – thousands of
negative examples – same learning
method throughout hierarchy
Humans: – can learn from one,
two, or three examples – often no negative
examples – more generalization
for broader concepts (“size principle”)
LEARNING A NOUN “blicket” “blicket”
“blicket”
BAYESIAN INFERENCE
Posterior probability
Likelihood Prior probability
Sum over space of hypotheses h: hypothesis
d: data
LEARNING NOUNS
• Data – object-word pairs
• Hypotheses – sets of objects
• Likelihood – weak sampling
– strong sampling
(Tenenbaum, 1999; Xu & Tenenbaum, 2007)
h x
w
h w
x weak strong
“blicket”
p(d|h) = 0
“blicket”
p(d|h) = 1/3
“blicket”
p(d|h) = (1/3)3
“blicket”
“blicket”
“blicket”
p(d|h) = 1/12
“blicket”
p(d|h) = (1/12)3
“blicket”
“blicket”
Principles
Hypotheses
Data
Whole-object principle Shape bias Taxonomic principle Contrast principle Basic-level bias
PREVIOUS WORK (XU & TENENBAUM, 2007)
PREVIOUS RESULTS (SMALL-SCALE)
• Small, hand-constructed domains
• Toy stimuli
• Constructing hypothesis space based on pairwise similarity judgments requires O(n2) judgments
Challenges
LARGE-SCALE WORD LEARNING
• Small, hand-constructed domains
• Toy stimuli
• Constructing hypothesis space based on pairwise similarity judgments requires O(n2) judgments
Solutions
•
•
• Hypothesis space is automatically derived from WordNet structure
Challenges
LARGE-SCALE WORD LEARNING
LARGE-SCALE WORD LEARNING
Here are three BLICKETS
Here are five FEPS
Here are four ZIVS Is this a BLICKET?
Is this a ZIV?
Is this a FEP?
Here are three FEPS
Can you help Mr. Frog find the other “FEPS”
LARGE SCALE MODEL RESULTS
[Abbott, Austerweil, and Griffiths, Constructing a hypothesis space from the web for large scale Bayesian word learning, COGSCI 2012]
GENERALIZING TO NEW IMAGES
• Bayesian word learning offers advantages over standard machine learning approaches.
• First challenge, scaling, has been solved (going from 45 to 14 million objects); but only using imagenet images and their location in hierarchy.
• Recent extension of model to incorporate direct pixel observations and model noisy recognition generalization from arbitrary objects and better predicts human data….
Task
Task
renuzit air fresher
coke-can
pringles
pasta-box
book-robotics
tea-box
leaf-node classes from cropped PR2 camera
images…
Image Classification Pipeline
• A convolutional neural network (CNN) pipeline is well suited to this type of data ([Jia et al CVPR12])
local feature coding
spatial pooling
SVM classification
“pasta box”
Densely-coded Local Features
• We extract overlapping local image patches
Encoding Local Features
local features
dictionary (learned in an unsupervised
fashion)
Encoded activation maps
Spatial Pooling
... Take max on a regular grid (or learned spatial bins
[cvpr12])
...
Convert to a feature vector
The whole pipeline
“pasta box”
Learning from Examples
• “Dear Robot, here are some feps”
• “Now get me more feps / are these also feps?”
(pasta-box) (peanut-butter) (spam)
Learning from Examples
• We assume a hierarchy of hypotheses from Wordnet or provided in a robot’s environment.
• Each hypothesis defines a subset of the leaf nodes (instances) that belong to it:
These are feps
This is probably not fep
These are not feps
Visually Grounded Word Learning at a Large Scale
• We tested this model in a large-scale with ImageNet
[Jia, Abbott, Austerweil, Griffiths, Darrell. 2012]
Human Behavior
More specific concepts More general concepts
More specific queries More general queries
Result Comparisons
Human Behavior
Our Model
Classical Concept Learning (without vision)
Classical Vision (without concept learning)
Video
Today: • Pose Pooling Kernels
for Sub-category Recognition
• Generalization in Large-Scale Concept Hierarchies
• Timely Recognition
Sergey Karayev, Tobi Baumgartner, Mario Fritz, Trevor Darrell NIPS 2012
Timely Object Recognition
Timely Object Recognition
Lots of classes and images in a large scale setting… Potentially different class values… Not enough time to run all object detectors…
Our Solution: Dynamic policy for selecting detectors
Vijayanarasimhan and Kapoor. Visual Recognition and Detection Under Bounded Computational Resources. CVPR 2010. Gao and Koller. Active Classification based on Value of Classifier. NIPS 2011
Recent Related Work
New Metric: Performance (AP) vs. Time
Area under the AP vs. T curve between start and deadline times
Belief State
Action
Observations
Belief State
Image
Action
Time
etc Observatio
ns
Sequential Detection
action selection
Belief State
Action
Observations
Belief State
Image
Action
Time
etc Observatio
ns maximize expected value
Sequential Detection
Actions
- scene context or object detector actions - run on the whole image - generate observations: - list of detections - GIST feature, etc.
Belief State
Action
Observations
Belief State
Image
Action
Time
etc Observatio
ns
execute action “black box”
receive observations
Sequential Detection
action selection maximize expected value
action selection maximize expected value
Belief State
Action
Observations
Belief State
Image
Action
Time
etc Observatio
ns
belief state update with observations, leverage context
execute action “black box”
receive observations
Sequential Detection
Class presence probabilities inferred from observations
Belief state update
Sequential Detection
Sequential Detection
Sequential Detection
Sequential Detection
Sequential Detection
Sequential Detection
action selection maximize expected value
Belief State
Action
Observations
Belief State
Image
Action
Time
etc Observatio
ns
belief state update with observations, leverage context
Sequential Detection
execute action “black box”
receive observations
policy:
action-value function:
assuming linear structure:
reward definition
Action selection maximize expected value
Reward definition: derived from the AP vs. Time metric
Reward definition: derived from the AP vs. Time metric
Reward definition: derived from the AP vs. Time metric
policy:
action-value function:
assuming linear structure:
learning the policy parameters
Action selection maximize expected value
Learning the policy
• Sample the expectation using Q-Iteration:
• collect (state,action,reward) tuples by running episodes
• solve for weights with L1 regression
Learning the policy
• Sample the expectation using Q-Iteration:
• collect (state,action,reward) tuples by running episodes
• solve for weights with L1 regression
• Cross-validate the discount
• When is 0, learning a greedy policy
• When is 1, looking ahead to the end of the episode.
Feature Representation & Learned Policy Weights
actio
ns
features
Features: - Prior probability of detector class presence - Class presence probabilities & entropies given observations - Time features (time to deadline, etc.)
Greedy policy ( )
actio
ns
features “RL” policy ( )
Features: - Prior probability of detector class presence - Class presence probabilities & entropies given observations - Time features (time to deadline, etc.)
Feature Representation & Learned Policy Weights
Policy Trajectories
Policy Trajectories
Policy Trajectories
Policy Trajectories
Evaluation Results
PASCAL VOC 2007 DPM detectors Mean Detection AP
Evaluation Results
PASCAL VOC 2007 DPM detectors Mean Detection AP
Today: • Pose Pooling Kernels
for Sub-category Recognition
• Generalization in Large-Scale Concept Hierarchies
• Timely Recognition
• Pose matters! – Distinctive fine-grained features should be relative to object configuration…
• Size matters! – Degree of generalization varies across category hierarchies and enables learning from few training examples…
• Plan ahead! – Learn what to look for, and when, in a large-scale setting to maximize time-constrained value...
TOWARDS LARGE-SCALE FINE-GRAINED RECOGNITION
• Ryan Farrell, Ning Zhang, Trevor Darrell, “Pose Pooling Kernels for Sub-category Recognition”, CVPR 2012
• Yangqing Jia, Joshua Abbott, Joe Austerweil, Tom Griffiths, Trevor Darrell, “Bayesian Concept Generalization”, UCB EECS TR
• Sergey Karayev, Tobi Baumgartner, Mario Fritz, Trevor Darrell, “Timely Object Recognition”, NIPS 2012
FOR MORE INFORMATION…
TOWARDS LARGE-SCALE FINE-GRAINED CATEGORY LEARNING:
POSE NORMALIZATION, HIERARCHICAL GENERALIZATION, AND TIMELY DETECTION
TREVOR DARRELL UC BERKELEY EECS & ICSI
RYAN FARRELL, NING ZHANG, YANGQING JIA, JOSHUA ABBOTT, JOESEPH AUSTERWEIL, TOM GRIFFITHS, SERGEY KARAYEV, TOBI
BAUMGARTNER, MARIO FRITZ