Transcript
Page 1: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Inferring What’s Important in Image Search

Kristen Grauman

University of Texas at Austin

With Adriana Kovashka, Devi Parikh, and Sung Ju Hwang

Page 2: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

“Visual” search 1.0

• Associate images by keywords and meta-data

Page 3: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Visual search 2.0

• Auto-annotate images with relevant keywords:

objects, attributes, scenes, visual concepts…

cow

furry

black

outdoors

[Kumar et al. 2008, Snoek et al. 2006, Naphade et al. 2006, Chang et al.

2006, Vaquero et al. 2009, Berg et al. 2010, and many others…]

Kristen Grauman, UT-Austin

Page 4: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Problem

• Fine-grained visual differences beyond keyword

composition influence image search relevance.

?

Similar object distributions, yet are they equally relevant?

vs

Kristen Grauman, UT-Austin

Page 5: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Problem

• Fine-grained visual differences beyond keyword

composition influence image search relevance.

How to capture target with a single description?

≠ brown strappy heels

Kristen Grauman, UT-Austin

Page 6: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Goal

• Fine-grained visual differences beyond keyword

composition influence image search relevance.

• Goal: Account for subtleties in visual relevance

– Implicit importance:

Infer which objects most define the scene

– Explicit importance:

Comparative feedback about which properties are

(ir)relevant

Kristen Grauman, UT-Austin

Page 7: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Related work

• Region-noun correspondence [Duygulu et al. 2002,

Barnard et al. 2003, Berg et al. 2004, Gupta & Davis

2008, Li et al. 2009, Hwang & Grauman 2010,…]

• Dual-view image-text representations [Monay & Gatica-

Perez 2003, Hardoon & Shawe-Taylor 2003, Quattoni et

al. 2007, Bekkerman & Jeon 2007, Quack et al. 2008,

Blaschko & Lampert 2008, Qi et al. 2009,…]

• Image description and memorability [Spain & Perona

2008, Farhadi et al. 2010, Berg et al. 2011, Parikh &

Grauman 2011, Isola et al. 2011, Berg et al. 2012]

Kristen Grauman, UT-Austin

Page 8: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Capturing relative importance

versus

Query Retrieved images

• Object presence != importance

Can we infer what human viewers find most important? Kristen Grauman, UT-Austin

Page 9: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

• Intuition: Human-provided tags give useful cues

beyond just which objects are present.

Based on tags alone, what can you say about the

mug in each image?

Mug Key Keyboard Toothbrush Pen Photo Post-it

Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster

? ?

Capturing relative importance

Kristen Grauman, UT-Austin

Page 10: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

• Intuition: Human-provided tags give useful cues

beyond just which objects are present.

Based on tags alone, what can you say about the

mug in each image?

Mug Key Keyboard Toothbrush Pen Photo Post-it

Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster

Capturing relative importance

Kristen Grauman, UT-Austin

Page 11: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

• Learn cross-modal representation that accounts

for “what to mention” using implicit cues from text

Our idea: Learning implicit importance

Textual:

• Frequency

• Relative order

• Mutual proximity

Visual:

• Texture

• Scene

• Color…

TAGS:

Cow Birds Architecture Water Sky

Training: human-given descriptions

Kristen Grauman, UT-Austin

Page 12: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

• Learn cross-modal representation that accounts

for “what to mention” using implicit cues from text

Our idea: Learning implicit importance

Textual:

• Frequency

• Relative order

• Mutual proximity

Visual:

• Texture

• Scene

• Color…

TAGS:

Cow Birds Architecture Water Sky

Training: human-given descriptions

Importance = how likely an object is

named early on by a human

describing an image.

Kristen Grauman, UT-Austin

Page 13: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Presence or absence of other objects affects the

scene layout record bag-of-words frequency.

Presence or absence of other objects affects the

scene layout

Mug Key Keyboard Toothbrush Pen Photo Post-it

Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster

People tag the “important” objects earlier record

rank of each tag compared to its typical rank.

People tend to move eyes to nearby objects after

first fixation

People tag the “important” objects earlier

People tend to move eyes to nearby objects after

first fixation record proximity of all tag pairs.

Mug Key Keyboard Toothbrush Pen Photo Post-it

Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster

2 3

4 5

6

7

1 1

2

3 4 5

6 7

8 9

Implicit tag features

Kristen Grauman, UT-Austin

Page 14: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Importance-aware

semantic space

View y View x

[Hwang & Grauman, IJCV 2011]

Learning an importance-aware semantic space

Untagged query image

Kristen Grauman, UT-Austin

Page 15: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Select projection bases:

Given paired data Linear CCA:

Kernel CCA: Given pair of kernel functions

Same objective, but projections in kernel space:

[Akaho 2001, Fyfe et al. 2001, Hardoon et al. 2004]

Learning an importance-aware semantic space

Kristen Grauman, UT-Austin

Page 16: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Assumptions

1. People tend to agree about which objects most

define a scene.

2. Significance of those objects in turn influences

the order in which they are mentioned.

Evidence from previous studies that these hold:

[von Ahn & Dabbish 2004, Tatler et al. 2005, Spain

& Perona 2008, Einhauser et al. 2008, Elazary &

Itti 2008, Berg et al. 2012]

Kristen Grauman, UT-Austin

Page 17: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Image+text datasets

• PASCAL VOC 2007 with tags ~10K images

• LabelMe images with tags ~4K images

• PASCAL VOC 2007 with sentences ~500 images

Text data collected on MTurk (~750 unique workers)

Tags Sentences

Kristen Grauman, UT-Austin

Page 18: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Query Image

Results: Accounting for

importance in image search

Our method

Words + Visual

Visual only

[Hwang & Grauman, IJCV 2011] Kristen Grauman, UT-Austin

Page 19: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Our method

Words + Visual

Visual only

Query Image

Results: Accounting for

importance in image search

[Hwang & Grauman, IJCV 2011] Kristen Grauman, UT-Austin

Page 20: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Query Image

Results: Accounting for

importance in image search

Words + Visual

Visual only

Our method

[Hwang & Grauman, IJCV 2011] Kristen Grauman, UT-Austin

Page 21: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Results: Accounting for

importance in image search

Our method better retrieves images that

share the query’s important objects

[Hwang & Grauman, IJCV 2011] Kristen Grauman, UT-Austin

Page 22: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Importance-aware

semantic space

Auto-tagging

Untagged

query image

We can also predict descriptions for novel images

Cow Tree Grass

Field Cow Fence Cow

Grass

[Hwang & Grauman, IJCV 2011] Kristen Grauman, UT-Austin

Page 23: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Results: Accounting for importance in auto-tagging

Person Tree Car Chair Window

Bottle Knife Napkin Light Fork

Tree Boat Grass Water Person

Boat Person Water Sky Rock

We can also predict descriptions for novel images

[Hwang & Grauman, IJCV 2011] Kristen Grauman, UT-Austin

Page 24: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

What do human judges think?

Select those images below that contain the “most important” objects seen in the query.

Kristen Grauman, UT-Austin

Page 25: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

What do human judges think?

Subjects are

323 MTurk

workers

Require

unanimous

vote among 5

for image to be

considered

relevant

Kristen Grauman, UT-Austin

Page 26: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Goal

• Fine-grained visual differences beyond keyword

composition influence image search relevance.

• Goal: Account for subtleties in visual relevance

– Implicit importance:

Infer which objects most define the scene

– Explicit importance:

Comparative feedback about which properties are

(ir)relevant

Kristen Grauman, UT-Austin

Page 27: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Problem with one-shot visual search

• Keywords (including attributes) can be

insufficient to capture target in one shot.

≠ brown strappy heels

Kristen Grauman, UT-Austin

Page 28: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Interactive visual search

Feedback

Results

• Iteratively refine the set of retrieved images based on user feedback on results so far

• Potential to communicate more precisely the desired visual content

Kristen Grauman, UT-Austin

Page 29: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

• Tuning system parameters difficult for user [Flickner et al. 1995, Ma & Manjunath 1997, Iqbal & Aggarwal 2002]

Limitations of traditional interactive methods

color

texture

shape

0.2

0.2

0.6

… …

Kristen Grauman, UT-Austin

Page 30: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

• Tuning system parameters difficult for user [Flickner et al. 1995, Ma & Manjunath 1997, Iqbal & Aggarwal 2002]

• Traditional binary feedback imprecise [Rui et al. 1998, Zhou et al. 2003, …]

“white

high

heels”

Limitations of traditional interactive methods

irrelevant

irrelevant relevant relevant

Kristen Grauman, UT-Austin

Page 31: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

WhittleSearch: Relative attribute feedback

Whittle away irrelevant images via precise semantic feedback

Feedback: “shinier

than these”

Feedback: “more formal

than these”

Refined top

search results

Initial top

search results

Kovashka, Parikh, and Grauman, CVPR 2012

Query: “white high-heeled shoes”

Kristen Grauman, UT-Austin

Page 32: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Feedback: “broader

nose”

Refined

top search

results

Initial

reference

images

Feedback: “similar hair

style”

WhittleSearch: Relative attribute feedback

Whittle away irrelevant images via precise semantic feedback

Kovashka, Parikh, and Grauman, CVPR 2012 Kristen Grauman, UT-Austin

Page 33: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Visual attributes

• High-level semantic properties shared by objects

• Human-understandable and machine-detectable

brown

indoors

outdoors flat

four-legged

high

heel

red has-

ornaments

metallic

[Farhadi et al. 2009, Lampert et al. 2009, Kumar et al. 2009,

Endres et al. 2010, Wang & Mori 2010, Berg et al. 2010, Branson

et al. 2010, Parikh & Grauman 2011, …]

Kristen Grauman, UT-Austin

Page 34: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

• Represent comparative relationships between

classes, images, and their properties.

Relative attributes

Properties

Concept

Properties

Concept

Properties

Brighter

than

[Parikh & Grauman, ICCV 2011]

Bright Bright

Kristen Grauman, UT-Austin

Page 35: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Learning relative attributes

• We want to learn a spectrum (ranking model) for an attribute, e.g. “brightness”.

• Supervision consists of:

Parikh and Grauman, ICCV 2011

Ordered pairs

Similar pairs

Kristen Grauman, UT-Austin

Page 36: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Learn a ranking function

that best satisfies the constraints:

Image features

Learned parameters

Learning relative attributes

Parikh and Grauman, ICCV 2011 Kristen Grauman, UT-Austin

Page 37: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Max-margin learning to rank formulation

Image Relative attribute score

Learning relative attributes

Joachims, KDD 2002; Parikh and Grauman, ICCV 2011

Rank margin

wm

Kristen Grauman, UT-Austin

Page 38: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Relating images

• Rank images according to attribute presence

bright

formal

natural

Kristen Grauman, UT-Austin

Page 39: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

WhittleSearch with relative attribute feedback

Offline:

We learn a spectrum for each attribute

During search:

1. User selects some reference images and marks how they differ from the desired target

2. We update the scores for each database image

natural

scores = scores + 1 scores = scores + 0 “I want something

less natural than this.”

Kristen Grauman, UT-Austin

Page 40: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

WhittleSearch with relative attribute feedback

natural

perspective “I want

something more natural

than this.” “I want something less natural than this.”

“I want something with more perspective than this.”

score = 0

score = 1 score = 1

score = 1

score = 1 score = 0

score = 1

score = 2 score = 1

score = 1

score = 2 score = 1

score = 2

score = 3 score = 2

score = 1

score = 2 score = 1

Kristen Grauman, UT-Austin

Page 41: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Shoes: [Berg; Kovashka] 14,658 shoe images;

10 attributes: “pointy”, “bright”, “high-heeled”, “feminine” etc.

OSR: [Oliva & Torralba] 2,688 scene images;

6 attributes: “natural”, “perspective”,

“open-air”, “close-depth” etc.

PubFig: [Kumar et al.] 772 face images;

11 attributes: “masculine”, “young”,

“smiling”, “round-face”, etc.

Datasets

41 Kristen Grauman, UT-Austin

Page 42: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Experimental setup

• Give the user the target image to look for

• Pair each target image with 16 reference images

• Get judgments on pairs from users on MTurk

Is ?

Binary feedback baseline

similar to

or

dissimilar from

Relative attribute feedback

Is than ?

pointy

open

bright

ornamented

shiny

high-heeled

long on the leg

formal

sporty

feminine

more

or

less

Kristen Grauman, UT-Austin

Page 43: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

[Kovashka et al., CVPR 2012]

We more rapidly converge on the envisioned visual content.

WhittleSearch Results

vs.

Kristen Grauman, UT-Austin

Page 44: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

[Kovashka et al., CVPR 2012]

We more rapidly converge on the envisioned visual content.

Richer feedback faster gains per unit of user effort.

WhittleSearch Results

Kristen Grauman, UT-Austin

Page 45: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

More open than

Example WhittleSearch

45

More open than

Less ornaments than

Match

Round 1

Ro

un

d 2

Round 3

Query: “I want a bright, open shoe that is short on the leg.”

Selected feedback

[Kovashka et al., CVPR 2012] Kristen Grauman, UT-Austin

Page 46: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Failure case (?)

Is the user searching for a specific person (identity), or a person meeting the description?

Kristen Grauman, UT-Austin

Page 47: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Hybrid relevance feedback

“shininess”

Image database

Relevance constraints

More relevant

Less relevant

“similar to these”

Feedback: “more shiny than these”

“dissimilar from these”

• We integrate relative attribute and binary feedback by learning a relevance ranking function.

[Kovashka et al., CVPR 2012] Kristen Grauman, UT-Austin

Page 48: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Dissimilar from

Less open than

Query: “I want a non-open shoe that is long on the leg and covered in ornaments.”

Match

Round 1

Round 2

Similar to

Selected feedback

More bright than

Example hybrid WhittleSearch

[Kovashka et al., CVPR 2012] Kristen Grauman, UT-Austin

Page 49: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Summary

• Fine-grained visual relevance is essential for next steps in image search

• Beyond tags when learning from text+images model implied importance cues

• Beyond clicks as feedback visual comparisons to refine search

Kristen Grauman, UT-Austin

Page 50: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Looking forward

• What is implied by natural language description beyond ordering? (tags vs. sentences)

• How to ensure that feedback user gives is useful (e.g., not redundant)?

• What attributes should be in the vocabulary?

• How to align user’s attribute language with the visual attribute models?

Kristen Grauman, UT-Austin

Page 51: Inferring What’s Important in Image Searchgrauman/slides/grauman-vabs-workshop...Inferring What’s Important in Image Search Kristen Grauman University of Texas at Austin With Adriana

Summary

• Fine-grained visual relevance is essential for next steps in image search

• Beyond tags when learning from text+images model implied importance cues

• Beyond clicks as feedback visual comparisons to refine search

Kristen Grauman, UT-Austin


Recommended