T L -SCALE FINE-GRAINED Ltrevor/public_html/... · 2012-09-28 · t owards l arge-s cale f ine-g rained c ategory l earning: p ose n ormalization, h ierarchical g eneralization, and

TOWARDS LARGE-SCALE FINE-GRAINED CATEGORY LEARNING

TREVOR DARRELL UC BERKELEY EECS & ICSI

TOWARDS LARGE-SCALE FINE-GRAINED CATEGORY LEARNING:

POSE NORMALIZATION, HIERARCHICAL GENERALIZATION, AND TIMELY DETECTION





RYAN FARRELL, NING ZHANG, YANGQING JIA, JOSHUA ABBOTT, JOESEPH AUSTERWEIL, TOM GRIFFITHS, SERGEY KARAYEV, TOBI

BAUMGARTNER, MARIO FRITZ

CATEGORIZATION SPECTRUM

CALTECH 101 (2004) Fei-Fei, Fergus, Perona CALTECH 256 (2007)

Griffin, Holub, Perona

Animals with Attributes (2009) Lampert, Nickisch, Harmeling, Weidmann Exploration of Methods for Automatic

Whale Identification (2001) Heusel , Gunn

Identifying Individual Salamanders (2007) Gamble, Ravela, McGarigal Labeled Faces in the Wild (2007)

Huang, Ramesh, Berg, Learned-Miller

STONEFLY9 (2009) Martínez-Muñoz, Zhang, Payet, Todorovic, Larios, Yamamuro,

Lytle, Moldenke, Mortensen, Paasch, Shapiro, Dietterich Caltech/UCSD Birds 200 (2010)

Welinder, Branson, Mita, Wah, Schroff, Belongie, Perona

Visual Identification of Plant Species (2008) Belhumeur, Chen, Feiner, Jacobs, Kress,

Ling, Lopez, Ramamoorthi, Sheorey, White, Zhang

Oxford Flowers (2006-09) Nilsback, Zisserman

Fine-grained recognition bridges traditional instance and category- level tasks...

TOWARDS LARGE-SCALE FINE-GRAINED RECOGNITION

Tasks are relatively difficult even for humans: bird subspecies, vehicle brands, etc.

People learn them from small numbers of positive examples

Both local feature geometry and statistical appearance are salient.

Key differences vs. traditional recognition: – Distinctive fine-grained features are often relative

to object configuration – Degree of generalization varies across category

hierarchies – Finest grained categories may have few training

examples – Can’t detect everything all the time in a large-

scale setting


Today: • Pose Pooling Kernels

for Sub-category Recognition

• Generalization in Large-Scale Concept Hierarchies

• Timely Recognition

POSE POOLING KERNELS FOR SUB-CATEGORY RECOGNITION

NING ZHANG, RYAN FARRELL, TREVOR DARRELL CVPR 2012

FINE-GRAINED VISUAL CATEGORIZATION

DISCRIMINATIVE DETAILS MAY BE HIGHLY LOCALIZED AND POSE RELATIVE

Scarlet Tanager Photo by Paul O’Toole

Summer Tanager Photo by Liam Wolff

Summer Tanager Photo by Patti Shoupe

Snowy Egret Photo by Shelley Rutkin

Great Egret Photo by James Hawkins

“BIRDLETS” [ICCV11] APPROACH

“Blue-Headed Vireo”!

DETECTION OF VOLUMETRIC PRIMITIVES

POSE-NORMALIZED APPEARANCE SPACE

2D VS. 3D

3D Representation 2D Representation

Pros • Most accurate representation for volumetric objects

• Facilitates pose-normalized appearance model

• Far easier to train (less complex and less costly annotations)

• Far more tolerant of localization errors

Cons • Accurate detection is challenging

• Annotations are costly • Volumetric part model

doesn’t capture flat parts such as wings

• Less precise part localization can attenuate key discriminative features

A classic story: recognition via detection

faces detection

/part localization

Alignment/normalization

identification/ attribute learning

Traditional Spatial Pooling

“POSE DEPENDENT” POOLING

OVERVIEW

“Red-Eyed Vireo”

Pose-Normalized Representation

Annotated images

Poselet training

Poselet activations

SVM classification

POSELET TRAINING

Parts: back, beak, belly, breast, crown, forehead, left eye, left leg, left wing, nape, right eye, right leg, right wing, , throat

Collect positive examples from other

training images

POSELET TRAINING

Bourdev and Malik, in ICCV 2009.

POSELET ACTIVATION (PREDICTION)

Poselet #90

Poselet #66

Poselet #65

COMPARING POSELETS

• Local image features from poselets which overlap in 3D are pooled together

POSE NORMALIZED REPRESENTATION

• Cluster based on warping distance or keypoint distance

POSELET CLUSTERS

Pose pooling vs. spatial pyramid

Pose pooling

EXPERIMENTAL RESULTS

P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, P. Perona. CalTech/UCSD, 2010. http://www.vision.caltech.edu/visipedia/CUB-200.html

27

EXPERIMENTAL RESULTS

28

• Implementation Details – Poselet Activations

• Activations are generated by comparing each poselet’s keypoint distributions with the annotated keypoint locations on each image.

– Appearance Descriptor • Bag of words on SIFT feature / Spatial Pyramid Match on top of

SIFT.

– SVM classifier • linear SVM / fast implementation of χ2 and Intersection kernel.

– Baseline methods • VLFEAT toolbox-SIFT on the bounding-box object w.o pose

normalized information.

CATEGORIZATION RESULTS

29

Baseline - VLFeat 29.73% (MAP)

Warped Feature 36.33% (MAP)

Pose Pooling Kernel 40.60% (MAP)

Confusion matrices on 14 categories across two bird families {vireos, woodpeckers}.

MORE RESULTS

30

• Detection • Better descriptors

– Physically inspired color representations – Salience cues to reduce false positive poselet

responses

• Experiments on larger datasets…

31

FUTURE WORK

Caltech UCSD Berkeley

For more information contact: Ryan Farrell - [email protected]

“Sweet” Taster-25 (released June 2012)

“Bitter” Sparrows-33 (will be released this Fall with the full

700+ category dataset)

Bounding Box

Segmentation Mask

Part Keypoints

Part Regions

Attributes

Among the Key Innovations • Photos collected via enthusiast photographer

submissions and annotated by citizen scientists • Expert-curated collection (category for each image

is verified by a domain expert) • Deep Taxonomy and tree-based distance measures

The dataset is being collected with tools designed for re-use in other fine-grained domains

D( , ) << D( , )

Caltech UCSD Berkeley

For more information contact: Ryan Farrell - [email protected]





BAYESIAN CONCEPT GENERALIZATION

Yangqing Jia, Joshua Abbott, Joe Austerweil,

Tom Griffiths, Trevor Darrell

TOWARDS LARGE-SCALE FINE-GRAINED CATEGORY RECOGNITION

Current ML algorithms: – hundreds of positive

examples – thousands of

negative examples – same learning

method throughout hierarchy

Humans: – can learn from one,

two, or three examples – often no negative

examples – more generalization

for broader concepts (“size principle”)

LEARNING A NOUN “blicket” “blicket”

“blicket”

BAYESIAN INFERENCE

Posterior probability

Likelihood Prior probability

Sum over space of hypotheses h: hypothesis

d: data

LEARNING NOUNS

• Data – object-word pairs

• Hypotheses – sets of objects

• Likelihood – weak sampling

– strong sampling

(Tenenbaum, 1999; Xu & Tenenbaum, 2007)

h x

w

h w

x weak strong

“blicket”

p(d|h) = 0

“blicket”

p(d|h) = 1/3

“blicket”

p(d|h) = (1/3)3

“blicket”

“blicket”

“blicket”

p(d|h) = 1/12

“blicket”

p(d|h) = (1/12)3

“blicket”

“blicket”

Principles

Hypotheses

Data

Whole-object principle Shape bias Taxonomic principle Contrast principle Basic-level bias

PREVIOUS WORK (XU & TENENBAUM, 2007)

PREVIOUS RESULTS (SMALL-SCALE)

• Small, hand-constructed domains

• Toy stimuli

• Constructing hypothesis space based on pairwise similarity judgments requires O(n2) judgments

Challenges

LARGE-SCALE WORD LEARNING

• Small, hand-constructed domains

• Toy stimuli

• Constructing hypothesis space based on pairwise similarity judgments requires O(n2) judgments

Solutions

•

•

• Hypothesis space is automatically derived from WordNet structure

Challenges



Here are three BLICKETS

Here are five FEPS

Here are four ZIVS Is this a BLICKET?

Is this a ZIV?

Is this a FEP?

Here are three FEPS

Can you help Mr. Frog find the other “FEPS”

LARGE SCALE MODEL RESULTS

[Abbott, Austerweil, and Griffiths, Constructing a hypothesis space from the web for large scale Bayesian word learning, COGSCI 2012]

GENERALIZING TO NEW IMAGES

• Bayesian word learning offers advantages over standard machine learning approaches.

• First challenge, scaling, has been solved (going from 45 to 14 million objects); but only using imagenet images and their location in hierarchy.

• Recent extension of model to incorporate direct pixel observations and model noisy recognition generalization from arbitrary objects and better predicts human data….

Task

Task

renuzit air fresher

coke-can

pringles

pasta-box

book-robotics

tea-box

leaf-node classes from cropped PR2 camera

images…

Image Classification Pipeline

• A convolutional neural network (CNN) pipeline is well suited to this type of data ([Jia et al CVPR12])

local feature coding

spatial pooling

SVM classification

“pasta box”

Densely-coded Local Features

• We extract overlapping local image patches

Encoding Local Features

local features

dictionary (learned in an unsupervised

fashion)

Encoded activation maps

Spatial Pooling

... Take max on a regular grid (or learned spatial bins

[cvpr12])

...

Convert to a feature vector

The whole pipeline

“pasta box”

Learning from Examples

• “Dear Robot, here are some feps”

• “Now get me more feps / are these also feps?”

(pasta-box) (peanut-butter) (spam)

Learning from Examples

• We assume a hierarchy of hypotheses from Wordnet or provided in a robot’s environment.

• Each hypothesis defines a subset of the leaf nodes (instances) that belong to it:

These are feps

This is probably not fep

These are not feps

Visually Grounded Word Learning at a Large Scale

• We tested this model in a large-scale with ImageNet

[Jia, Abbott, Austerweil, Griffiths, Darrell. 2012]

Human Behavior

More specific concepts More general concepts

More specific queries More general queries

Result Comparisons

Human Behavior

Our Model

Classical Concept Learning (without vision)

Classical Vision (without concept learning)

Video





Sergey Karayev, Tobi Baumgartner, Mario Fritz, Trevor Darrell NIPS 2012

Timely Object Recognition

Timely Object Recognition

Lots of classes and images in a large scale setting… Potentially different class values… Not enough time to run all object detectors…

Our Solution: Dynamic policy for selecting detectors

Vijayanarasimhan and Kapoor. Visual Recognition and Detection Under Bounded Computational Resources. CVPR 2010. Gao and Koller. Active Classification based on Value of Classifier. NIPS 2011

Recent Related Work

New Metric: Performance (AP) vs. Time

Area under the AP vs. T curve between start and deadline times

Belief State

Action

Observations

Belief State

Image

Action

Time

etc Observatio

ns

Sequential Detection

action selection

Belief State

Action

Observations

Belief State

Image

Action

Time

etc Observatio

ns maximize expected value


Actions

- scene context or object detector actions - run on the whole image - generate observations: - list of detections - GIST feature, etc.

Belief State

Action

Observations

Belief State

Image

Action

Time

etc Observatio

ns

execute action “black box”

receive observations


action selection maximize expected value


Belief State

Action

Observations

Belief State

Image

Action

Time

etc Observatio

ns

belief state update with observations, leverage context




Class presence probabilities inferred from observations

Belief state update








Belief State

Action

Observations

Belief State

Image

Action

Time

etc Observatio

ns

belief state update with observations, leverage context




policy:

action-value function:

assuming linear structure:

reward definition

Action selection maximize expected value

Reward definition: derived from the AP vs. Time metric



policy:

action-value function:

assuming linear structure:

learning the policy parameters

Action selection maximize expected value

Learning the policy

• Sample the expectation using Q-Iteration:

• collect (state,action,reward) tuples by running episodes

• solve for weights with L1 regression

Learning the policy

• Sample the expectation using Q-Iteration:

• collect (state,action,reward) tuples by running episodes

• solve for weights with L1 regression

• Cross-validate the discount

• When is 0, learning a greedy policy

• When is 1, looking ahead to the end of the episode.

Feature Representation & Learned Policy Weights

actio

ns

features

Features: - Prior probability of detector class presence - Class presence probabilities & entropies given observations - Time features (time to deadline, etc.)

Greedy policy ( )

actio

ns

features “RL” policy ( )

Features: - Prior probability of detector class presence - Class presence probabilities & entropies given observations - Time features (time to deadline, etc.)

Feature Representation & Learned Policy Weights

Policy Trajectories

Policy Trajectories

Policy Trajectories

Policy Trajectories

Evaluation Results

PASCAL VOC 2007 DPM detectors Mean Detection AP

Evaluation Results

PASCAL VOC 2007 DPM detectors Mean Detection AP





• Pose matters! – Distinctive fine-grained features should be relative to object configuration…

• Size matters! – Degree of generalization varies across category hierarchies and enables learning from few training examples…

• Plan ahead! – Learn what to look for, and when, in a large-scale setting to maximize time-constrained value...


• Ryan Farrell, Ning Zhang, Trevor Darrell, “Pose Pooling Kernels for Sub-category Recognition”, CVPR 2012

• Yangqing Jia, Joshua Abbott, Joe Austerweil, Tom Griffiths, Trevor Darrell, “Bayesian Concept Generalization”, UCB EECS TR

• Sergey Karayev, Tobi Baumgartner, Mario Fritz, Trevor Darrell, “Timely Object Recognition”, NIPS 2012

FOR MORE INFORMATION…




RYAN FARRELL, NING ZHANG, YANGQING JIA, JOSHUA ABBOTT, JOESEPH AUSTERWEIL, TOM GRIFFITHS, SERGEY KARAYEV, TOBI

BAUMGARTNER, MARIO FRITZ

Documents

T L -SCALE FINE-GRAINED Ltrevor/public_html/... · 2012-09-28 · t owards l arge-s cale f ine-g rained c ategory l earning: p ose n ormalization, h ierarchical g eneralization, and