117

Click here to load reader

NIPS2009: Understand Visual Scenes - Part 1

  • Upload
    zukun

  • View
    511

  • Download
    0

Embed Size (px)

Citation preview

Page 1: NIPS2009: Understand Visual Scenes - Part 1

Understanding Visual Scenes

Antonio Torralba

Computer Science and Artificial Intelligence Laboratory (CSAIL) Department of Electrical Engineering and Computer Science

Page 2: NIPS2009: Understand Visual Scenes - Part 1
Page 3: NIPS2009: Understand Visual Scenes - Part 1
Page 4: NIPS2009: Understand Visual Scenes - Part 1

DUCK

DUCK

GRASS

PERSON

TREE

LAKE

BENCH

PERSON

PERSON PERSON

DUCK

PATH

SKY

SIGN

Page 5: NIPS2009: Understand Visual Scenes - Part 1

A VIEW OF A PARK ON A NICE SPRING DAY

Page 6: NIPS2009: Understand Visual Scenes - Part 1

PERSON FEEDING DUCKS IN THE PARK

PEOPLE WALKING IN THE PARK

DUCKS LOOKING FOR FOOD

Do not feed the ducks sign

Page 7: NIPS2009: Understand Visual Scenes - Part 1

DUCKS ON TOP OF THE GRASS

PEOPLE UNDER THE SHADOW OF THE TREES

Page 8: NIPS2009: Understand Visual Scenes - Part 1

Why do we care about recognition? Perception of function: We can perceive the 3D shape, texture,

material properties, without knowing about objects. But, the concept of category encapsulates also information

about what can we do with those objects.

“We therefore include the perception of function as a proper –indeed, crucial- subject for vision science”, from Vision Science, chapter 9, Palmer.

Page 9: NIPS2009: Understand Visual Scenes - Part 1

The perception of function •  Direct perception (affordances): Gibson

Flat surface Horizontal Knee-high …

Sittable upon

Chair Chair

Chair?

Flat surface Horizontal Knee-high …

Sittable upon

Chair

•  Mediated perception (Categorization)

Page 10: NIPS2009: Understand Visual Scenes - Part 1

Direct perception Some aspects of an object function can be

perceived directly •  Functional form: Some forms clearly

indicate to a function (“sittable-upon”, container, cutting device, …)

Sittable-upon Sittable-upon

Sittable-upon

It does not seem easy to sit-upon this…

Page 11: NIPS2009: Understand Visual Scenes - Part 1

Scenes, as objects, also have affordances

Page 12: NIPS2009: Understand Visual Scenes - Part 1

The function of the scene

Page 13: NIPS2009: Understand Visual Scenes - Part 1
Page 14: NIPS2009: Understand Visual Scenes - Part 1

Direct perception Some aspects of an object function can be

perceived directly •  Observer relativity: Function is observer

dependent From http://lastchancerescueflint.org

Page 15: NIPS2009: Understand Visual Scenes - Part 1

Limitations of Direct Perception Objects of similar structure might have very different functions

Not all functions seem to be available from direct visual information only.

Page 16: NIPS2009: Understand Visual Scenes - Part 1

Limitations of Direct Perception

Page 17: NIPS2009: Understand Visual Scenes - Part 1

Object detection and recognition Short overview of current approaches

Page 18: NIPS2009: Understand Visual Scenes - Part 1

Object recognition Is it really so hard?

This is a chair

Find the chair in this image Output of normalized correlation

Page 19: NIPS2009: Understand Visual Scenes - Part 1

Object recognition Is it really so hard?

Find the chair in this image

Pretty much garbage Simple template matching is not going to make it

Page 20: NIPS2009: Understand Visual Scenes - Part 1

So, let’s make the problem simpler: Block world

Object Recognition in the Geometric Era: a Retrospective. Joseph L. Mundy. 2006

Page 21: NIPS2009: Understand Visual Scenes - Part 1

Object Recognition in the Geometric Era: a Retrospective. Joseph L. Mundy. 2006

Binford and generalized cylinders

Recognition by components

Irving Biederman Recognition-by-Components: A Theory of Human Image Understanding. Psychological Review, 1987.

Introduced in computer vision by A. Pentland, 1986.

Page 22: NIPS2009: Understand Visual Scenes - Part 1

Families of recognition algorithms Bag of words models Voting models

Constellation models Rigid template models

Sirovich and Kirby 1987 Turk, Pentland, 1991 Dalal & Triggs, 2006

Fischler and Elschlager, 1973 Burl, Leung, and Perona, 1995

Weber, Welling, and Perona, 2000 Fergus, Perona, & Zisserman, CVPR 2003

Viola and Jones, ICCV 2001 Heisele, Poggio, et. al., NIPS 01

Schneiderman, Kanade 2004 Vidal-Naquet, Ullman 2003

Shape matching Deformable models

Csurka, Dance, Fan, Willamowski, and Bray 2004 Sivic, Russell, Freeman, Zisserman, ICCV 2005

Berg, Berg, Malik, 2005 Cootes, Edwards, Taylor, 2001

Page 23: NIPS2009: Understand Visual Scenes - Part 1

Feret dataset, 1996 DARPA

The face age

•  The representation and matching of pictorial structures Fischler, Elschlager (1973). •  Face recognition using eigenfaces M. Turk and A. Pentland (1991). •  Human Face Detection in Visual Scenes - Rowley, Baluja, Kanade (1995) •  Graded Learning for Object Detection - Fleuret, Geman (1999) •  Robust Real-time Object Detection - Viola, Jones (2001) •  Feature Reduction and Hierarchy of Classifiers for Fast Object Detection in Video Images - Heisele, Serre, Mukherjee, Poggio (2001) • ….

Page 24: NIPS2009: Understand Visual Scenes - Part 1

Face detection

Page 25: NIPS2009: Understand Visual Scenes - Part 1

Haar-like filters and cascades Viola and Jones, ICCV 2001

The average intensity in the block is computed with four sums independently of the block size.

Also Fleuret and Geman, 2001

Page 26: NIPS2009: Understand Visual Scenes - Part 1

Generic objects: Edge based descriptors

Gavrila, Philomin, ICCV 1999 Papageorgiou & Poggio (2000)

J. Shotton, A. Blake, R. Cipolla. PAMI 2008.

Opelt, Pinz, Zisserman, ECCV 2006

Page 27: NIPS2009: Understand Visual Scenes - Part 1

Histograms of oriented gradients

Shape context Belongie, Malik, Puzicha, NIPS 2000 SIFT, D. Lowe, ICCV 1999

Page 28: NIPS2009: Understand Visual Scenes - Part 1

Histograms of oriented gradients Dalal & Trigs, 2006

x Not a person

x person

Page 29: NIPS2009: Understand Visual Scenes - Part 1

Adding parts Felzenszwalb, McAllester, Ramanan. 2008.

Page 30: NIPS2009: Understand Visual Scenes - Part 1

Felzenszwalb, McAllester, Ramanan. 2008.

Adding parts

Page 31: NIPS2009: Understand Visual Scenes - Part 1

Felzenszwalb, McAllester, Ramanan. 2008.

Page 32: NIPS2009: Understand Visual Scenes - Part 1

Evaluation of performance

The detector challenge: by looking at the output of a detector on a random set of images, can you guess which object is it trying to detect?

Before plotting and ROC or precision-recall curves…

Page 33: NIPS2009: Understand Visual Scenes - Part 1

What object is detector trying to detect?

The detector challenge: by looking at the output of a detector on a random set of images, can you guess which object is it trying to detect?

Page 34: NIPS2009: Understand Visual Scenes - Part 1

What object is detector trying to detect?

The detector challenge: by looking at the output of a detector on a random set of images, can you guess which object is it trying to detect?

Table detector

Page 35: NIPS2009: Understand Visual Scenes - Part 1

1 2

3

4

5

6

7

1. chair, 2. table, 3. road, 4. road, 5. table, 6. car, 7. keyboard.

Page 36: NIPS2009: Understand Visual Scenes - Part 1

Some symptoms of standard approaches

Page 37: NIPS2009: Understand Visual Scenes - Part 1
Page 38: NIPS2009: Understand Visual Scenes - Part 1
Page 39: NIPS2009: Understand Visual Scenes - Part 1
Page 40: NIPS2009: Understand Visual Scenes - Part 1
Page 41: NIPS2009: Understand Visual Scenes - Part 1

Scenes rule over objects

3D percept is driven by the scene, which imposes its ruling to the objects

Page 42: NIPS2009: Understand Visual Scenes - Part 1

Scene recognition The gist of the scene

Page 43: NIPS2009: Understand Visual Scenes - Part 1

Mary Potter (1976) Mary Potter (1975, 1976) demonstrated that during a rapid sequential visual presentation (100 msec per image), a novel picture is instantly understood and observers seem to comprehend a lot of visual information

Page 44: NIPS2009: Understand Visual Scenes - Part 1

Demo : Rapid image understanding

Instructions: 9 photographs will be shown for half a second each. Your task is to memorize these pictures

By Aude Oliva

Page 45: NIPS2009: Understand Visual Scenes - Part 1
Page 46: NIPS2009: Understand Visual Scenes - Part 1
Page 47: NIPS2009: Understand Visual Scenes - Part 1
Page 48: NIPS2009: Understand Visual Scenes - Part 1
Page 49: NIPS2009: Understand Visual Scenes - Part 1
Page 50: NIPS2009: Understand Visual Scenes - Part 1
Page 51: NIPS2009: Understand Visual Scenes - Part 1
Page 52: NIPS2009: Understand Visual Scenes - Part 1
Page 53: NIPS2009: Understand Visual Scenes - Part 1
Page 54: NIPS2009: Understand Visual Scenes - Part 1
Page 55: NIPS2009: Understand Visual Scenes - Part 1

Which of the following pictures have you seen ?

If you have seen the image clap your hands once

If you have not seen the image do nothing

Memory Test

Page 56: NIPS2009: Understand Visual Scenes - Part 1

Have you seen this picture ?

Page 57: NIPS2009: Understand Visual Scenes - Part 1

NO

Page 58: NIPS2009: Understand Visual Scenes - Part 1

Have you seen this picture ?

Page 59: NIPS2009: Understand Visual Scenes - Part 1

NO

Page 60: NIPS2009: Understand Visual Scenes - Part 1

Have you seen this picture ?

Page 61: NIPS2009: Understand Visual Scenes - Part 1

NO

Page 62: NIPS2009: Understand Visual Scenes - Part 1

Have you seen this picture ?

Page 63: NIPS2009: Understand Visual Scenes - Part 1

NO

Page 64: NIPS2009: Understand Visual Scenes - Part 1

Have you seen this picture ?

Page 65: NIPS2009: Understand Visual Scenes - Part 1

Yes

Page 66: NIPS2009: Understand Visual Scenes - Part 1

Have you seen this picture ?

Page 67: NIPS2009: Understand Visual Scenes - Part 1

NO

Page 68: NIPS2009: Understand Visual Scenes - Part 1

You have seen these pictures

You were tested with these pictures

Page 69: NIPS2009: Understand Visual Scenes - Part 1

The gist of the scene

In a glance, we remember the meaning of an image and its global layout but some objects and details are forgotten

Page 70: NIPS2009: Understand Visual Scenes - Part 1

From objects to scenes

Image I

Local features L L L L

O2 O2 O2 O2

S SceneType 2 {street, office, …}

O1 O1 O1 O1 Object localization

Riesenhuber & Poggio (99); Vidal-Naquet & Ullman (03); Serre & Poggio, (05); Agarwal & Roth, (02), Moghaddam, Pentland (97), Turk, Pentland (91),Vidal-Naquet, Ullman, (03) Heisele, et al, (01), Agarwal & Roth, (02), Kremp, Geman, Amit (02), Dorko, Schmid, (03) Fergus, Perona, Zisserman (03), Fei Fei, Fergus, Perona, (03), Schneiderman, Kanade (00), Lowe (99)

Page 71: NIPS2009: Understand Visual Scenes - Part 1

What makes scenes different?

Different objects, different spatial layout

Floor

Door

Light

Wall Wall Door

Ceiling

Painting

Fireplace armchair armchair

Coffee table

Door Door

Ceiling Lamp

mirror mirror wall

Door

wall

wall

painting

Bed Side-table

Lamp

phone alarm

carpet

Page 72: NIPS2009: Understand Visual Scenes - Part 1

What makes scenes different?

Different objects, similar spatial layout

Window

sink dishes faucet

cabinet

cabinet

counter

cabinet

mirror shelves mirror

ceiling

wall

towel sink sink

cabinet cabinet

faucet soap counter

counter beer

mirror glasses

stool

lamp lamp

Page 73: NIPS2009: Understand Visual Scenes - Part 1

What makes scenes different?

Similar objects, different spatial layout

table

table

table

ceiling

wall window window

bottle

chair chair

chair

chair

chair chair

chair chair

chair chair

chair

chair

table

table

table

table

table table

table table

table

table table

table

chair chair

chair chair

floor floor

wall

ceiling window

window

Page 74: NIPS2009: Understand Visual Scenes - Part 1

What makes scenes different?

Similar objects, different spatial layout

table

table

table

ceiling

wall window window

bottle

chair chair

chair

chair

chair chair

chair chair

chair chair

chair

chair

table

table

table

table

table table

table table

table

table table

table

chair chair

chair

floor floor

wall

ceiling window

window

chair

Page 75: NIPS2009: Understand Visual Scenes - Part 1

What makes scenes different?

Similar objects, and similar spatial layout

seat seat

seat seat

seat seat

seat seat

window window window

ceiling cabinets cabinets

seat seat

seat seat

seat seat

seat seat window window

ceiling cabinets cabinets

seat seat seat seat

seat seat seat seat

seat seat seat seat

seat seat seat seat

screen

ceiling

wall column

Different lighting, different materials, very specific object categories

Page 76: NIPS2009: Understand Visual Scenes - Part 1

What can be an alternative to objects?

Page 77: NIPS2009: Understand Visual Scenes - Part 1

Scene emergent features “Recognition via features that are not those of individual objects but “emerge” as objects are brought into relation to each other to form a scene.” – Biederman 81

From “on the semantics of a glance at a scene”, Biederman, 1981

Page 78: NIPS2009: Understand Visual Scenes - Part 1

Examples of scene emergent features

Suggestive edges and junctions Simple geometric forms

Blobs Textures

Page 79: NIPS2009: Understand Visual Scenes - Part 1

Ensemble statistics Ariely, 2001, Seeing sets: Representation by statistical properties Chong, Treisman, 2003, Representation of statistical properties

Alvarez, Oliva, 2008, 2009, Spatial ensemble statistics

Conclusion: observers had more accurate representation of the mean than of the individual members of the set.

Page 80: NIPS2009: Understand Visual Scenes - Part 1

From scenes to objects

S SceneType 2 {street, office, …}

Image

Local features L L L

I

O1 O1 O1 O1 O2 O2 O2 O2

L•  Scene emergent •  Ensemble statistics •  Global features

GObject localization

Page 81: NIPS2009: Understand Visual Scenes - Part 1

How far can we go without objects?

S SceneType 2 {street, office, …}

Image

Local features L L L

I

L

•  Scene emergent •  Ensemble statistics •  Global features

G

Page 82: NIPS2009: Understand Visual Scenes - Part 1

Global image descriptors

Page 83: NIPS2009: Understand Visual Scenes - Part 1

Global image descriptors

Sivic et. al., ICCV 2005 Fei-Fei and Perona, CVPR 2005

Bag of words Spatially organized textures

Non localized textons

S. Lazebnik, et al, CVPR 2006 Walker, Malik. Vision Research 2004 …

M. Gorkani, R. Picard, ICPR 1994 A. Oliva, A. Torralba, IJCV 2001

… R. Datta, D. Joshi, J. Li, and J. Z. Wang, Image Retrieval: Ideas, Influences, and Trends of the New Age, ACM Computing Surveys, vol. 40, no. 2, pp. 5:1-60, 2008.

Page 84: NIPS2009: Understand Visual Scenes - Part 1

Gist descriptor

8 orientations 4 scales x 16 bins 512 dimensions

•  Apply oriented Gabor filters over different scales •  Average filter energy in each bin

Similar to SIFT (Lowe 1999) applied to the entire image M. Gorkani, R. Picard, ICPR 1994; Walker, Malik. Vision Research 2004; Vogel et al. 2004; Fei-Fei and Perona, CVPR 2005; S. Lazebnik, et al, CVPR 2006; …

Oliva and Torralba, 2001

Page 85: NIPS2009: Understand Visual Scenes - Part 1

Textons Filter bank K-means (100 clusters)

Walker, Malik, 2004

Malik, Belongie, Shi, Leung, 1999

Page 86: NIPS2009: Understand Visual Scenes - Part 1

Bag of words & spatial pyramid matching

S. Lazebnik, et al, CVPR 2006

Sivic, Zisserman, 2003. Visual words = Kmeans of SIFT descriptors

Page 87: NIPS2009: Understand Visual Scenes - Part 1

The 15-scenes benchmark

Bedroom Suburb

Industrial Kitchen

Living room Coast Forest

Highway

Building facade

Mountain Open country Street

Skyscrapers

Office

Store

Oliva & Torralba, 2001 Fei Fei & Perona, 2005 Lazebnik, et al 2006

Page 88: NIPS2009: Understand Visual Scenes - Part 1

Scene recognition 100 training samples per class

SVM classifier in both cases Human performance

Page 89: NIPS2009: Understand Visual Scenes - Part 1

Large Scale Scene Recognition Nature Urban  Indoor 

~1,000 categories 

>130,000 images 

>12,000 fully annotated images 

Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010

Page 90: NIPS2009: Understand Visual Scenes - Part 1
Page 91: NIPS2009: Understand Visual Scenes - Part 1

Performance with 400 categories

Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010

Page 92: NIPS2009: Understand Visual Scenes - Part 1

Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010

Abbey

Airplane cabin

Airport terminal

Alley

Amphitheater

Training images

Page 93: NIPS2009: Understand Visual Scenes - Part 1

Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010

Abbey

Airplane cabin

Airport terminal

Alley

Amphitheater

Training images Correct classifications

Page 94: NIPS2009: Understand Visual Scenes - Part 1

Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010

Abbey

Airplane cabin

Airport terminal

Alley

Amphitheater

Monastery Cathedral Castle

Toy shop Van Discotheque

Subway Stage Restaurant

Restaurant patio

Courtyard Canal

Harbor Coast Athletic field

Training images Correct classifications Miss-classifications

Page 95: NIPS2009: Understand Visual Scenes - Part 1

Example of three different scenes RIVER 

BEACH 

VILLAGE 

Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010

Page 96: NIPS2009: Understand Visual Scenes - Part 1

But they are all part of the same picture

Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010

Page 97: NIPS2009: Understand Visual Scenes - Part 1

But they are all part of the same picture

Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010

Page 98: NIPS2009: Understand Visual Scenes - Part 1

Scene detection

Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010

Page 99: NIPS2009: Understand Visual Scenes - Part 1

Categories or a continuous space?

Check poster by Malisiewicz, Efros 

Page 100: NIPS2009: Understand Visual Scenes - Part 1

Categories or a continuous space? From the city to the mountains in 10 steps

Page 101: NIPS2009: Understand Visual Scenes - Part 1

Spatial envelope: a continuous space of scenes

Degree of Openness

Deg

ree

of E

xpan

sion

Highway Street City centre Tall Building

Oliva & Torralba, 2001

Page 102: NIPS2009: Understand Visual Scenes - Part 1

Coast Countryside Forest Mountain

Deg

ree

of R

ugge

dnes

s

Degree of Openness

Spatial envelope: a continuous space of scenes

Oliva & Torralba, 2001

Page 103: NIPS2009: Understand Visual Scenes - Part 1

Context for object recognition

Page 104: NIPS2009: Understand Visual Scenes - Part 1

Who needs context anyway? We can recognize objects even out of context 

Banksy

Page 105: NIPS2009: Understand Visual Scenes - Part 1

Look‐Alikes by Joan Steiner 

Even in high resolution, we can not shut down contextual processing and it is hard to recognize the true identities of the elements that compose this scene.

Page 106: NIPS2009: Understand Visual Scenes - Part 1

Why is context important? •  Changes the interpretation of an object (or its function)

•  Context defines what an unexpected event is

Page 107: NIPS2009: Understand Visual Scenes - Part 1

Objects and Scenes 

Biederman’s violaPons (1981): 

Page 108: NIPS2009: Understand Visual Scenes - Part 1

Global precedence Forest Before Trees: The Precedence of Global Features in Visual Perception Navon (1977)

Page 109: NIPS2009: Understand Visual Scenes - Part 1

Scene recognition without object recognition

S

g

Scene

Scene features

Murphy, Torralba, Freeman; NIPS 2003. Torralba, Murphy, Freeman, CACM 2010.

Page 110: NIPS2009: Understand Visual Scenes - Part 1

An integrated model of Scenes, Objects, and Parts

Ncar

S

g

Scene

Scene gist

features

0

0

1

1

5

5

N

P(Ncar | S = street)

P(Ncar | S = park) N

Murphy, Torralba, Freeman; NIPS 2003. Torralba, Murphy, Freeman, CACM 2010.

Page 111: NIPS2009: Understand Visual Scenes - Part 1

Object retrieval: scene features vs. detector Results using the keyboard detector alone

Results using both the detector and the global scene features

Murphy, Torralba, Freeman; NIPS 2003. Torralba, Murphy, Freeman, CACM 2010.

Page 112: NIPS2009: Understand Visual Scenes - Part 1

The layered structure of scenes 

p(x2|x1) p(x)

In a display with multiple targets present, the location of one target constraints the ‘y’ coordinate of the remaining targets, but not the ‘x’ coordinate.

Assuming a human observer standing on the ground

Torralba, Oliva, Castelhano, Henderson. 2006

Page 113: NIPS2009: Understand Visual Scenes - Part 1

Context driven object detection

Zcar Ncar

S

g

Scene

Scene gist

features

0 1 5

P(Ncar | S = street)

N

Murphy, Torralba, Freeman; NIPS 2003. Torralba, Murphy, Freeman, CACM 2010.

Page 114: NIPS2009: Understand Visual Scenes - Part 1

An integrated model of Scenes, Objects, and Parts

p(d | F=1) = N(d | µ1, σ1) p(d | F=0) = N(d | µ0, σ0)

We train a multiview car detector.

xcari dcar

i

car Fi

N=4

Murphy, Torralba, Freeman; NIPS 2003. Torralba, Murphy, Freeman, CACM 2010.

Page 115: NIPS2009: Understand Visual Scenes - Part 1

An integrated model of Scenes, Objects, and Parts

Zcar Ncar

S

g

Scene

Scene gist

features xcar

i dcari

car Fi

M=4

Murphy, Torralba, Freeman; NIPS 2003. Torralba, Murphy, Freeman, CACM 2010.

Page 116: NIPS2009: Understand Visual Scenes - Part 1

Two tasks 

Page 117: NIPS2009: Understand Visual Scenes - Part 1