Upload
lee-parker
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
READING BETWEEN THE LINES: OBJECT LOCALIZATION USING IM-PLICIT CUES FROM IMAGE TAGS
Sung Ju Hwang and Kristen GraumanUniversity of Texas at Austin
CVPR 2010
Hwang & Grauman, CVPR 2010
Images tagged with keywords clearly tell us which objects to search for
Detecting tagged objects
DogBlack labJasperSofaSelfLiving roomFedoraExplore#24
Hwang & Grauman, CVPR 2010
Duygulu et al. 2002
Detecting tagged objects
Previous work using tagged images fo-cuses on the noun ↔ object correspon-dence.
Fergus et al. 2005
Li et al., 2009
Berg et al. 2004
[Lavrenko et al. 2003, Monay & Gatica-Perez 2003, Barnard et al. 2004, Schroff et al. 2007, Gupta & Davis 2008, Vijayanarasimhan & Grauman 2008, …]
Hwang & Grauman, CVPR 2010
MugKeyKeyboardToothbrushPenPhotoPost-it
ComputerPosterDeskBookshelfScreenKeyboardScreenMugPoster
? ?
Based on tags alone, can you guess where and what size the mug will be in each im-age?
Our Idea
The list of human-provided tags gives useful cues beyond just which objects are present.
Hwang & Grauman, CVPR 2010
ComputerPosterDeskBookshelfScreenKeyboardScreenMugPoster
MugKeyKeyboardToothbrushPenPhotoPost-it
Our Idea
Presence of larger objectsMug is named first
Absence of larger objectsMug is named later
The list of human-provided tags gives useful cues beyond just which objects are present.
Hwang & Grauman, CVPR 2010
Our Idea
We propose to learn the implicit localiza-tion cues provided by tag lists to improve object detection.
Hwang & Grauman, CVPR 2010
WomanTableMugLadder
Approach overview
MugKeyKeyboardTooth-brushPenPhotoPost-it
Object de-tector
Implicit tag features
ComputerPosterDeskScreenMugPoster
Training: Learn object-specific connection between localization parameters and implicit tag features.
MugEiffel
DeskMugOffice
MugCoffee
Testing: Given novel image, localize objects based on both tags and appearance.
P (location, scale | tags)
Implicit tag features
Hwang & Grauman, CVPR 2010
WomanTableMugLadder
Approach overview
MugKeyKeyboardTooth-brushPenPhotoPost-it
Object de-tector
Implicit tag features
ComputerPosterDeskScreenMugPoster
Training: Learn object-specific connection between localization parameters and implicit tag features.
MugEiffel
DeskMugOffice
MugCoffee
Testing: Given novel image, localize objects based on both tags and appearance.
P (location, scale | tags)
Implicit tag features
Hwang & Grauman, CVPR 2010
Feature: Word presence/ab-sence
MugKeyKeyboardToothbrushPenPhotoPost-it
ComputerPosterDeskBookshelfScreenKeyboardScreenMugPoster
Presence or absence of other objects af-fects the scene layout record bag-of-words frequency.
Presence or absence of other objects af-fects the scene layout
= count of i-th word.
, where
Mug Pen Post-it Toothbrush
Key Photo Com-puter
Screen Key-board
Desk Book-shelf
Poster
W(im1) 1 1 1 1 1 1 0 0 1 0 0 0
W(im2) 1 0 0 0 0 0 1 2 1 1 1 1
Hwang & Grauman, CVPR 2010
Feature: Word presence/ab-sence
MugKeyKeyboardToothbrushPenPhotoPost-it
ComputerPosterDeskBookshelfScreenKeyboardScreenMugPoster
Presence or absence of other objects af-fects the scene layout record bag-of-words frequency.
Presence or absence of other objects af-fects the scene layout
Large objects men-tioned
Small objects men-tioned
= count of i-th word.
, where
Mug Pen Post-it Toothbrush
Key Photo Com-puter
Screen Key-board
Desk Book-shelf
Poster
W(im1) 1 1 1 1 1 1 0 0 1 0 0 0
W(im2) 1 0 0 0 0 0 1 2 1 1 1 1
Hwang & Grauman, CVPR 2010
Feature: Rank of tags
People tag the “important” objects ear-lierPeople tag the “important” objects earlier record rank of each tag compared to its typical rank.
= percentile rank of i-th word.
, whereMugKeyKeyboardToothbrushPenPhotoPost-it
ComputerPosterDeskBookshelfScreenKeyboardScreenMugPoster
Mug Com-puter
Screen Key-board
Desk Book-shelf
Poster Photo Pen Post-it Toothbrush
Key
R(im1) 0.80 0 0 0.51 0 0 0 0.28 0.72 0.82 0 0.90
R(im2) 0.23 0.62 0.21 0.13 0.48 0.61 0.41 0 0 0 0 0
Hwang & Grauman, CVPR 2010
Feature: Rank of tags
People tag the “important” objects ear-lierPeople tag the “important” objects earlier record rank of each tag compared to its typical rank.
= percentile rank of i-th word.
, whereMugKeyKeyboardToothbrushPenPhotoPost-it
ComputerPosterDeskBookshelfScreenKeyboardScreenMugPoster
Relatively high rank
Mug Com-puter
Screen Key-board
Desk Book-shelf
Poster Photo Pen Post-it Toothbrush
Key
R(im1) 0.80 0 0 0.51 0 0 0 0.28 0.72 0.82 0 0.90
R(im2) 0.23 0.62 0.21 0.13 0.48 0.61 0.41 0 0 0 0 0
Hwang & Grauman, CVPR 2010
Feature: Proximity of tags
People tend to move eyes to nearby ob-jects after first fixation record proximity of all tag pairs.
People tend to move eyes to nearby ob-jects after first fixation
= rank differ-ence.
, where
1) Mug2) Key3) Keyboard4) Toothbrush5) Pen6) Photo7) Post-it
1) Computer2) Poster3) Desk4) Bookshelf5) Screen6) Keyboard7) Screen8) Mug9) Poster
2 3
45
6
7
1
2
34 5
67
891
Mug ScreenKey-
boardDesk
Book-shelf
Mug 1 0 0.5 0 0Screen 0 0 0 0Key-
board 1 0 0
Desk 0 0Book-shelf 0
Mug ScreenKey-
boardDesk
Book-shelf
Mug 1 1 0.5 0.2 0.25Screen 1 1 0.33 0.5Key-
board 1 0.33 0.5
Desk 1 1Book-shelf 1
Hwang & Grauman, CVPR 2010
Feature: Proximity of tags
People tend to move eyes to nearby ob-jects after first fixation record proximity of all tag pairs.
People tend to move eyes to nearby ob-jects after first fixation
= rank differ-ence.
, where
1) Mug2) Key3) Keyboard4) Toothbrush5) Pen6) Photo7) Post-it
1) Computer2) Poster3) Desk4) Bookshelf5) Screen6) Keyboard7) Screen8) Mug9) Poster
2 3
45
6
7
1
2
34 5
67
891
Mug ScreenKey-
boardDesk
Book-shelf
Mug 1 0 0.5 0 0Screen 0 0 0 0Key-
board 1 0 0
Desk 0 0Book-shelf 0
Mug ScreenKey-
boardDesk
Book-shelf
Mug 1 1 0.5 0.2 0.25Screen 1 1 0.33 0.5Key-
board 1 0.33 0.5
Desk 1 1Book-shelf 1May be close to each other
Hwang & Grauman, CVPR 2010
WomanTableMugLadder
Approach overview
MugKeyKeyboardTooth-brushPenPhotoPost-it
Object de-tector
Implicit tag features
ComputerPosterDeskScreenMugPoster
MugEiffel
DeskMugOffice
MugCoffee
P (location, scale | W,R,P)
Implicit tag features
Training:
Testing:
Hwang & Grauman, CVPR 2010
Modeling P(X|T)
We model it directly using a mixture den-sity network (MDN) [Bishop, 1994].
We need PDF for location and scale of the target object, given the tag feature:
P(X = scale, x, y | T = tag feature)
Input tag feature(Words, Rank, or Proximity)
Mixture model
Neural network
α µ Σ α µ Σ α µ Σ
Hwang & Grauman, CVPR 2010
Lamp
CarWheelWheelLight
WindowHouseHouse
CarCarRoadHouseLightpole
CarWindowsBuildingManBarrelCarTruckCar
Boulder
Car
Modeling P(X|T)
Example: Top 30 most likely localization pa-rameters sampled for the object “car”, given only the tags.
Hwang & Grauman, CVPR 2010
Lamp
CarWheelWheelLight
WindowHouseHouse
CarCarRoadHouseLightpole
CarWindowsBuildingManBarrelCarTruckCar
Boulder
Car
Modeling P(X|T)
Example: Top 30 most likely localization pa-rameters sampled for the object “car”, given only the tags.
Hwang & Grauman, CVPR 2010
WomanTableMugLadder
Approach overview
MugKeyKeyboardTooth-brushPenPhotoPost-it
Object de-tector
Implicit tag features
ComputerPosterDeskScreenMugPoster
MugEiffel
DeskMugOffice
MugCoffee
P (location, scale | W,R,P)
Implicit tag features
Training:
Testing:
Hwang & Grauman, CVPR 2010
Integrating with object detec-tor
How to exploit this learned distribution P(X|T)?1)Use it to speed up the detection
process (location priming)
Hwang & Grauman, CVPR 2010
Integrating with object detec-tor
How to exploit this learned distribution P(X|T)?1)Use it to speed up the detection
process (location priming)
(a) Sort all candi-date windows according to P(X|T).
Most likelyLess likelyLeast likely
(b) Run detector only at the most probable locations and scales.
Integrating with object detec-tor
How to exploit this learned distribution P(X|T)?1)Use it to speed up the detection
process (location priming)2)Use it to increase detection accuracy
(modulate the detector output scores)Predictions from object detector
0.70.8
0.9
Predictions based on tag features
0.3
0.2
0.9
Integrating with object detec-tor
How to exploit this learned distribution P(X|T)?1)Use it to speed up the detection
process (location priming)2)Use it to increase detection accuracy
(modulate the detector output scores)
0.630.24
0.18
Hwang & Grauman, CVPR 2010
Experiments: Datasets
LabelMe PASCAL VOC 2007 Street and office
scenes Contains ordered tag
lists via labels added 5 classes 56 unique taggers 23 tags / image Dalal & Trigg’s HOG de-
tector
Flickr images Tag lists obtained on
Mechanical Turk 20 classes 758 unique taggers 5.5 tags / image Felzenszwalb et al.’s
LSVM detector
Hwang & Grauman, CVPR 2010
Experiments
We evaluate Detection Speed Detection Accuracy
We compare Raw detector (HOG, LSVM) Raw detector + Our tag features
We also show the results when using Gist [Torralba 2003] as context, for reference.
Hwang & Grauman, CVPR 2010
We search fewer win-dows to achieve same detection rate.
We know which detec-tion hypotheses to trust most.
PASCAL: Performance evaluation
0 0.2 0.4 0.60
0.2
0.4
0.6
0.8
1
recall
pre
cisi
on
Accuracy: All 20 PASCAL Classes
LSVM (AP=33.69)
0 0.2 0.4 0.60
0.2
0.4
0.6
0.8
1
recall
pre
cisi
on
Accuracy: All 20 PASCAL Classes
LSVM (AP=33.69)LSVM+Tags (AP=36.79)
0 0.2 0.4 0.60
0.2
0.4
0.6
0.8
1
recall
pre
cisi
on
Accuracy: All 20 PASCAL Classes
LSVM (AP=33.69)LSVM+Tags (AP=36.79)LSVM+Gist (AP=36.28)
0 0.2 0.4 0.6 0.80
0.2
0.4
0.6
0.8
detection rate
po
rtio
n o
f win
do
ws
sea
rch
ed
Speed: All 20 LabelMe Classes
Sliding (0.223)
0 0.2 0.4 0.6 0.80
0.2
0.4
0.6
0.8
detection rate
po
rtio
n o
f win
do
ws
sea
rch
ed
Speed: All 20 LabelMe Classes
Sliding (0.223)Sliding+Tags (0.098)
0 0.2 0.4 0.6 0.80
0.2
0.4
0.6
0.8
detection rate
po
rtio
n o
f win
do
ws
sea
rch
ed
Speed: All 20 LabelMe Classes
Sliding (0.223)Sliding+Tags (0.098)Sliding+Gist (0.125)
Naïve sliding window searches 70%.
We search only 30%.
0 0.2 0.4 0.60
0.2
0.4
0.6
0.8
1
recall
pre
cisi
on
Accuracy: All 20 PASCAL Classes
LSVM (AP=33.69)LSVM+Tags (AP=36.79)LSVM+Gist (AP=36.28)
Hwang & Grauman, CVPR 2010
potte
dpla
nt cat
sofa
boat
mot
orbi
ketra
in car
chai
r
tvm
onito
r
hors
e
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
Tags
Object Class
AP Im
pro
vem
ent
PASCAL: Accuracy vs Gist per class
Hwang & Grauman, CVPR 2010
potte
dpla
nt cat
sofa
boat
mot
orbi
ketra
in car
chai
r
tvm
onito
r
hors
e
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
TagsGist
Object Class
AP Im
pro
vem
ent
PASCAL: Accuracy vs Gist per class
Hwang & Grauman, CVPR 2010
LampPersonBottleDogSofaPaintingTable
PersonTableChairMirrorTableclothBowlBottleShelfPaintingFood
Bottle
CarLicense PlateBuilding
Car
LSVM+Tags (Ours)
LSVM alone
PASCAL: Example detections
CarDoorDoorGearSteering WheelSeatSeatPersonPersonCamera
Hwang & Grauman, CVPR 2010
DogFloorHairclip
DogDogDogPersonPersonGroundBenchScarf
PersonMicrophoneLight
HorsePersonTreeHouseBuildingGroundHurdleFence
PASCAL: Example detectionsDog
Person
LSVM+Tags (Ours)
LSVM alone
Hwang & Grauman, CVPR 2010
AeroplaneSkyBuildingShadow
PersonPersonPoleBuildingSidewalkGrassRoad
DogClothesRopeRopePlantGroundShadowStringWall
BottleGlassWineTable
PASCAL: Example failure cases
LSVM+Tags (Ours)
LSVM alone
Hwang & Grauman, CVPR 2010
Results: Observations
Often our implicit features predict:- scale well for indoor objects- position well for outdoor objects
Gist usually better for y position, while our tags are generally stronger for scale
Need to have learned about target ob-jects in variety of examples with differ-ent contexts
- visual and tag context are complementary
Hwang & Grauman, CVPR 2010
Summary
We want to learn what is implied (beyond objects present) by how a human provides tags for an image.
Approach translates existing insights about hu-man viewing behavior (attention, importance, gaze, etc.) into enhanced object detection.
Novel tag cues enable effective localization prior.
Significant gains with state-of-the-art detectors and two datasets.
Hwang & Grauman, CVPR 2010
Joint multi-object detection
From tags to natural language sen-tences
Image retrieval applications
Future work