Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford

Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

Bangpeng Yao and Li Fei-Fei

Computer Science Department, Stanford University

{bangpeng,feifeili}@cs.stanford.edu

1

2

Human-Object Interaction

Playing saxophoneHuman SaxophoneNot playing saxophone

Robots interact with objects

Automatic sports commentary

“Kobe is dunking the ball.”

Medical care

3

Human-Object Interaction

Background: Human-Object Interaction

• Schneiderman & Kanade, 2000• Viola & Jones, 2001• Huang et al, 2007• Papageorgiou & Poggio, 2000• Wu & Nevatia, 2005• Dalal & Triggs, 2005• Mikolajczyk et al, 2005• Leibe et al, 2005• Bourdev & Malik, 2009• Felzenszwalb & Huttenlocher, 2005• Ren et al, 2005• Ramanan, 2006• Ferrari et al, 2008• Yang & Mori, 2008• Andriluka et al, 2009• Eichner & Ferrari, 2009

• Lowe, 1999• Belongie et al, 2002• Fergus et al, 2003• Fei-Fei et al, 2004• Berg & Malik, 2005• Felzenszwalb et al, 2005• Grauman & Darrell, 2005• Sivic et al, 2005• Lazebnik et al, 2006• Zhang et al, 2006• Savarese et al, 2007• Lampert et al, 2008• Desai et al, 2009• Gehler & Nowozin, 2009

• Murphy et al, 2003• Hoiem et al, 2006• Shotton et al, 2006

• Rabinovich et al, 2007• Heitz & Koller, 2008• Divvala et al, 2009

• Gupta et al, 2009

4

context

vs.

To be done

• Yao & Fei-Fei, 2010a• Yao & Fei-Fei, 2010b

Background: Human-Object Interaction

• Schneiderman & Kanade, 2000• Viola & Jones, 2001• Huang et al, 2007• Papageorgiou & Poggio, 2000• Wu & Nevatia, 2005• Dalal & Triggs, 2005• Mikolajczyk et al, 2005• Leibe et al, 2005• Bourdev & Malik, 2009• Felzenszwalb & Huttenlocher, 2005• Ren et al, 2005• Ramanan, 2006• Ferrari et al, 2008• Yang & Mori, 2008• Andriluka et al, 2009• Eichner & Ferrari, 2009

• Lowe, 1999• Belongie et al, 2002• Fergus et al, 2003• Fei-Fei et al, 2004• Berg & Malik, 2005• Felzenszwalb et al, 2005• Grauman & Darrell, 2005• Sivic et al, 2005• Lazebnik et al, 2006• Zhang et al, 2006• Savarese et al, 2007• Lampert et al, 2008• Desai et al, 2009• Gehler & Nowozin, 2009

• Murphy et al, 2003• Hoiem et al, 2006• Shotton et al, 2006

• Rabinovich et al, 2007• Heitz & Koller, 2008• Divvala et al, 2009

• Gupta et al, 2009

5

context

vs.

To be done

• Yao & Fei-Fei, 2010a• Yao & Fei-Fei, 2010b

• Intuition of Grouplet Representation

• Grouplet Feature Representation

• Using Grouplet for Recognition

• Dataset & Experiments

• Conclusion

Outline

6





• Conclusion

Outline

7

8

Recognizing Human-Object Interaction is Challenging

Different background

Same object (saxophone), different interactions

Different pose (or viewpoint)

Different lighting

Different instrument, similar pose

Reference image: playing saxophone

9

Grouplet: our intuitionBag-of-words Spatial pyramid Part-based

• Thomas & Malik, 2001• Csurka et al, 2004• Fei-Fei & Perona, 2005• Sivic et al, 2005

• Grauman & Darrell, 2005• Lazebnik et al, 2006

• Weber et al, 2000• Fergus et al, 2003• Leibe et al, 2004• Felzenszwalb et al, 2005• Bourdev & Malik, 2009

Grouplet Representation:

0 20 40 60 80 100 120 140 160 180 2000

5

10

15

20

25

10

Grouplet: our intuitionGrouplet Representation:

• Part-based

configuration

• Co-occurrence

• Discriminative

• Dense

Capture the subtle difference in human-object interactions.





• Conclusion

Outline

11

12

Grouplet representation (e.g. 2-Grouplet)

I

2 2 2 2:{ , , }A x λ

1 1 1 1:{ , , }A x λ

1 2{ , } λ λP• I: Image.• P: Reference point in the image.• Λ: Grouplet.• λi: Feature unit.

- Ai: Visual codeword;

- xi: Image location;

- σi: Variance of spatial distribution.

Notations

Visual codewords Gaussian distribution

13


• I: Image.• P: Reference point in the image.• Λ: Grouplet.• λi: Feature unit.

• ν(Λ,I): Matching score of Λ and I.• ν(λi,I): Matching score of λi and I.




Notations

( , ) min ,ii

v I v I λ

Matching score between Λ and I

Matching score between λi and I


I

2 2 2 2:{ , , }A x λ

1 1 1 1:{ , , }A x λ

1 2{ , } λ λP

2 2 2 2:{ , , }A x λ

1 2{ , } λ λ

14



• ν(Λ,I): Matching score of Λ and I.• ν(λi,I): Matching score of λi and I.• For an image patch:

• Ω(x): Image neighborhood of x.




Notations

- a′: Its visual appearance;- x′: Its image location.

( , ) min ,ii

v I v I λ

min p( | ) ( | , )i

i i ii

x x

A a N x x

Codeword assignment score

Gaussian density value




I

1 1 1 1:{ , , }A x λ

P

min max p( | ) ( | , )

ji i

ji i i i

i jx x

A a N x x

( , ) min ,ii

v I v I λ

15



• ν(Λ,I): Matching score of Λ and I.• ν(λi,I): Matching score of λi and I.• For an image patch:

• Ω(x): Image neighborhood of x.• Δ: A small shift of the location.




Notations




- a′: Its visual appearance;- x′: Its image location.

min p( | ) ( | , )i

i i ii

x x

A a N x x





I

2 2 2 2:{ , , }A x λ

1 1 1 1:{ , , }A x λ

1 2{ , } λ λP

matching score: 0.6

16

Grouplet representation

• Part-based configuration

• Co-occurrence

• Discriminative

matching score: 0.4 matching score: 0.0 matching score: 0.1

Playing saxophone Other interactions

I

2 2 2 2:{ , , }A x λ

1 1 1 1:{ , , }A x λ

1 2{ , } λ λP

17

• Part-based configuration

• Co-occurrence

• Discriminative

• Dense

Grouplet representation

All possible Codewords

Densely sample image locations

Many possible spatial distributions

1-grouplet 2-grouplet 3-grouplet

All possible combinations of feature units

I

2 2 2 2:{ , , }A x λ

1 1 1 1:{ , , }A x λ

1 2{ , } λ λP





• Conclusion

Outline

18

A “Space” of Grouplets

19

20

Playing violin

Other interactions


21

Playing violin

Other interactions

Playing saxophone

Other interactions


22

Playing violin

Other interactions

Playing saxophone

Other interactions

On background

Shared by different interactions


Shared by different interactions

On background

2323

We only need discriminative Grouplets

Large ν(Λ,I) Small ν(Λ,I) Large ν(Λ,I) Small ν(Λ,I)

Playing violin

Other interactions

Playing saxophone

Other interactions

Number of Grouplets: 2N very large space

Number of feature units: N. N is large (192200)

24

Obtaining discriminative grouplets for a class

Obtain grouplets with large ν(Λ,I) on the class.

Remove grouplets with large ν(Λ,I) from other classes.

Apriori Mining

[Agrawal & Srikant, 1994]

Selected 1-grouplets

Candidate 2-grouplets

Number of Grouplets: 2N very large space

Number of feature units: N. N is large (192200)

Mine 1000~2000 grouplets, only need to evaluate (2~100)×N grouplets

25

Using Grouplets for Classification

1, , , ,NI I Discriminative

grouplets

1, , N

SVM

I





• Conclusion

Outline

26

People-Playing-Musical-Instruments (PPMI) Datasethttp://vision.stanford.edu/resources_links.html

PPMI+

PPMI-

27

(172)

(164)

(191)

(148)

(177)

(133)

(179)

(149)

(200)

(188)

(198)

(169)

(185)

(167)

# Image:

# Image:

Original image Normalized image(200 images each interaction)

Recognition Tasks on People-Playing-Musical-Instruments (PPMI) Dataset

28

Classification Detection

Playing saxophone

Playing bassoon

Playing saxophone

Playing French horn

Playing violin

vs.

Playing violin

Not playing violin

vs.

Playing different instruments

Playing vs. Not playing

For each interaction, 100 training and 100 testing images.

Classification: Playing Different Instruments

• 7-class classification on PPMI+ images

SPM: [Lazebnik et al, 2006]DPM: [Felzenszwalb et al, 2008]Constellation: [Fergus et al, 2003] [Niebles & Fei-Fei, 2007]

59.9%

54.9%

39.0%37.7%

Grouplet+SVM

SPMDPMConstel-lation

BoW

65.7%

Cla

ssifi

catio

n ac

cura

cy

0.7

0.6

0.5

0.4

29

1 2 3 4 5 60

200

400

600

800

1000

1200

Grouplet sizeN

o. o

f m

ined

Gro

uple

ts

Ave

rage

P

PM

I+ im

ages

Classifying Playing vs. Not playing

30

• Seven 2-class classification problem; PPMI+ vs. PPMI- for each instrument.

Ave

rage

P

PM

I- im

ages

Bassoon Erhu Flute French horn Guitar Saxophone Violin

Acc

urac

y

Grouplet+SVMDPM DPMBoW SPM

Bassoon Erhu Flute French horn Saxophone Violin

Ave

rage

P

PM

I+ im

ages

Classifying Playing vs. Not playing

31

• Seven 2-class classification problem; PPMI+ vs. PPMI- for each instrument.

Ave

rage

P

PM

I- im

ages

Bassoon Erhu Flute French horn Guitar Saxophone Violin

Acc

urac

y

Grouplet+SVMDPM DPMBoW SPM

Guitar

Detecting people playing musical instruments

32

• Face detection with a low threshold;

• Crop and normalize image regions;

• 8-class classification

Procedure:

Playing saxophone No playing No playing

- 7 classes of playing instruments;

- Another class of not playing any

instrument.

33


Playing saxophone

Playing bassoon

Playing French horn

Playing saxophone

Playing French horn

Area under the precision-recall curve:

• Out method: 45.7%; • Spatial pyramid: 37.3%.

34


Playing French horn

False detection Missed detection

Area under the precision-recall curve:

• Out method: 45.7%; • Spatial pyramid: 37.3%.

35

Examples of Mined Grouplets

Playing bassoon:

Playing saxophone:

Playing violin:

Playing guitar:

36

Conclusion

• Holistic image-based classification

Vs.

[B. Yao and L. Fei-Fei. “Modeling mutual context of object and human pose in human-object interaction activities.” CVPR 2010.]

[B. Yao and L. Fei-Fei. “Grouplet: A structured image representation for recognizing human and object interactions.” CVPR 2010.]

• Detailed understanding and reasoning

Pose estimation & object detection

The Next TalkThe Next Talk

Playing saxophone

Playing bassoon

Playing saxophone

Thanks toJuan Carlos Niebles, Jia Deng, Jia Li, Hao Su, Silvio Savarese, and anonymous reviewers.

And You

37

Documents

Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford