Upload
allan-price
View
221
Download
0
Embed Size (px)
Citation preview
Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions
Bangpeng Yao and Li Fei-Fei
Computer Science Department, Stanford University
{bangpeng,feifeili}@cs.stanford.edu
1
2
Human-Object Interaction
Playing saxophoneHuman SaxophoneNot playing saxophone
Robots interact with objects
Automatic sports commentary
“Kobe is dunking the ball.”
Medical care
3
Human-Object Interaction
Background: Human-Object Interaction
• Schneiderman & Kanade, 2000• Viola & Jones, 2001• Huang et al, 2007• Papageorgiou & Poggio, 2000• Wu & Nevatia, 2005• Dalal & Triggs, 2005• Mikolajczyk et al, 2005• Leibe et al, 2005• Bourdev & Malik, 2009• Felzenszwalb & Huttenlocher, 2005• Ren et al, 2005• Ramanan, 2006• Ferrari et al, 2008• Yang & Mori, 2008• Andriluka et al, 2009• Eichner & Ferrari, 2009
• Lowe, 1999• Belongie et al, 2002• Fergus et al, 2003• Fei-Fei et al, 2004• Berg & Malik, 2005• Felzenszwalb et al, 2005• Grauman & Darrell, 2005• Sivic et al, 2005• Lazebnik et al, 2006• Zhang et al, 2006• Savarese et al, 2007• Lampert et al, 2008• Desai et al, 2009• Gehler & Nowozin, 2009
• Murphy et al, 2003• Hoiem et al, 2006• Shotton et al, 2006
• Rabinovich et al, 2007• Heitz & Koller, 2008• Divvala et al, 2009
• Gupta et al, 2009
4
context
vs.
To be done
• Yao & Fei-Fei, 2010a• Yao & Fei-Fei, 2010b
Background: Human-Object Interaction
• Schneiderman & Kanade, 2000• Viola & Jones, 2001• Huang et al, 2007• Papageorgiou & Poggio, 2000• Wu & Nevatia, 2005• Dalal & Triggs, 2005• Mikolajczyk et al, 2005• Leibe et al, 2005• Bourdev & Malik, 2009• Felzenszwalb & Huttenlocher, 2005• Ren et al, 2005• Ramanan, 2006• Ferrari et al, 2008• Yang & Mori, 2008• Andriluka et al, 2009• Eichner & Ferrari, 2009
• Lowe, 1999• Belongie et al, 2002• Fergus et al, 2003• Fei-Fei et al, 2004• Berg & Malik, 2005• Felzenszwalb et al, 2005• Grauman & Darrell, 2005• Sivic et al, 2005• Lazebnik et al, 2006• Zhang et al, 2006• Savarese et al, 2007• Lampert et al, 2008• Desai et al, 2009• Gehler & Nowozin, 2009
• Murphy et al, 2003• Hoiem et al, 2006• Shotton et al, 2006
• Rabinovich et al, 2007• Heitz & Koller, 2008• Divvala et al, 2009
• Gupta et al, 2009
5
context
vs.
To be done
• Yao & Fei-Fei, 2010a• Yao & Fei-Fei, 2010b
• Intuition of Grouplet Representation
• Grouplet Feature Representation
• Using Grouplet for Recognition
• Dataset & Experiments
• Conclusion
Outline
6
• Intuition of Grouplet Representation
• Grouplet Feature Representation
• Using Grouplet for Recognition
• Dataset & Experiments
• Conclusion
Outline
7
8
Recognizing Human-Object Interaction is Challenging
Different background
Same object (saxophone), different interactions
Different pose (or viewpoint)
Different lighting
Different instrument, similar pose
Reference image: playing saxophone
9
Grouplet: our intuitionBag-of-words Spatial pyramid Part-based
• Thomas & Malik, 2001• Csurka et al, 2004• Fei-Fei & Perona, 2005• Sivic et al, 2005
• Grauman & Darrell, 2005• Lazebnik et al, 2006
• Weber et al, 2000• Fergus et al, 2003• Leibe et al, 2004• Felzenszwalb et al, 2005• Bourdev & Malik, 2009
Grouplet Representation:
0 20 40 60 80 100 120 140 160 180 2000
5
10
15
20
25
10
Grouplet: our intuitionGrouplet Representation:
• Part-based
configuration
• Co-occurrence
• Discriminative
• Dense
Capture the subtle difference in human-object interactions.
• Intuition of Grouplet Representation
• Grouplet Feature Representation
• Using Grouplet for Recognition
• Dataset & Experiments
• Conclusion
Outline
11
12
Grouplet representation (e.g. 2-Grouplet)
I
2 2 2 2:{ , , }A x λ
1 1 1 1:{ , , }A x λ
1 2{ , } λ λP• I: Image.• P: Reference point in the image.• Λ: Grouplet.• λi: Feature unit.
- Ai: Visual codeword;
- xi: Image location;
- σi: Variance of spatial distribution.
Notations
Visual codewords Gaussian distribution
13
Grouplet representation (e.g. 2-Grouplet)
• I: Image.• P: Reference point in the image.• Λ: Grouplet.• λi: Feature unit.
• ν(Λ,I): Matching score of Λ and I.• ν(λi,I): Matching score of λi and I.
- Ai: Visual codeword;
- xi: Image location;
- σi: Variance of spatial distribution.
Notations
( , ) min ,ii
v I v I λ
Matching score between Λ and I
Matching score between λi and I
Visual codewords Gaussian distribution
I
2 2 2 2:{ , , }A x λ
1 1 1 1:{ , , }A x λ
1 2{ , } λ λP
2 2 2 2:{ , , }A x λ
1 2{ , } λ λ
14
Grouplet representation (e.g. 2-Grouplet)
• I: Image.• P: Reference point in the image.• Λ: Grouplet.• λi: Feature unit.
• ν(Λ,I): Matching score of Λ and I.• ν(λi,I): Matching score of λi and I.• For an image patch:
• Ω(x): Image neighborhood of x.
- Ai: Visual codeword;
- xi: Image location;
- σi: Variance of spatial distribution.
Notations
- a′: Its visual appearance;- x′: Its image location.
( , ) min ,ii
v I v I λ
min p( | ) ( | , )i
i i ii
x x
A a N x x
Codeword assignment score
Gaussian density value
Visual codewords Gaussian distribution
Matching score between Λ and I
Matching score between λi and I
I
1 1 1 1:{ , , }A x λ
P
min max p( | ) ( | , )
ji i
ji i i i
i jx x
A a N x x
( , ) min ,ii
v I v I λ
15
Grouplet representation (e.g. 2-Grouplet)
• I: Image.• P: Reference point in the image.• Λ: Grouplet.• λi: Feature unit.
• ν(Λ,I): Matching score of Λ and I.• ν(λi,I): Matching score of λi and I.• For an image patch:
• Ω(x): Image neighborhood of x.• Δ: A small shift of the location.
- Ai: Visual codeword;
- xi: Image location;
- σi: Variance of spatial distribution.
Notations
Matching score between Λ and I
Codeword assignment score
Gaussian density value
- a′: Its visual appearance;- x′: Its image location.
min p( | ) ( | , )i
i i ii
x x
A a N x x
Visual codewords Gaussian distribution
Matching score between λi and I
Codeword assignment score
Gaussian density value
I
2 2 2 2:{ , , }A x λ
1 1 1 1:{ , , }A x λ
1 2{ , } λ λP
matching score: 0.6
16
Grouplet representation
• Part-based configuration
• Co-occurrence
• Discriminative
matching score: 0.4 matching score: 0.0 matching score: 0.1
Playing saxophone Other interactions
I
2 2 2 2:{ , , }A x λ
1 1 1 1:{ , , }A x λ
1 2{ , } λ λP
17
• Part-based configuration
• Co-occurrence
• Discriminative
• Dense
Grouplet representation
All possible Codewords
Densely sample image locations
Many possible spatial distributions
1-grouplet 2-grouplet 3-grouplet
All possible combinations of feature units
I
2 2 2 2:{ , , }A x λ
1 1 1 1:{ , , }A x λ
1 2{ , } λ λP
• Intuition of Grouplet Representation
• Grouplet Feature Representation
• Using Grouplet for Recognition
• Dataset & Experiments
• Conclusion
Outline
18
A “Space” of Grouplets
19
20
Playing violin
Other interactions
A “Space” of Grouplets
21
Playing violin
Other interactions
Playing saxophone
Other interactions
A “Space” of Grouplets
22
Playing violin
Other interactions
Playing saxophone
Other interactions
On background
Shared by different interactions
A “Space” of Grouplets
Shared by different interactions
On background
2323
We only need discriminative Grouplets
Large ν(Λ,I) Small ν(Λ,I) Large ν(Λ,I) Small ν(Λ,I)
Playing violin
Other interactions
Playing saxophone
Other interactions
Number of Grouplets: 2N very large space
Number of feature units: N. N is large (192200)
24
Obtaining discriminative grouplets for a class
Obtain grouplets with large ν(Λ,I) on the class.
Remove grouplets with large ν(Λ,I) from other classes.
Apriori Mining
[Agrawal & Srikant, 1994]
Selected 1-grouplets
Candidate 2-grouplets
Number of Grouplets: 2N very large space
Number of feature units: N. N is large (192200)
Mine 1000~2000 grouplets, only need to evaluate (2~100)×N grouplets
25
Using Grouplets for Classification
1, , , ,NI I Discriminative
grouplets
1, , N
SVM
I
• Intuition of Grouplet Representation
• Grouplet Feature Representation
• Using Grouplet for Recognition
• Dataset & Experiments
• Conclusion
Outline
26
People-Playing-Musical-Instruments (PPMI) Datasethttp://vision.stanford.edu/resources_links.html
PPMI+
PPMI-
27
(172)
(164)
(191)
(148)
(177)
(133)
(179)
(149)
(200)
(188)
(198)
(169)
(185)
(167)
# Image:
# Image:
Original image Normalized image(200 images each interaction)
Recognition Tasks on People-Playing-Musical-Instruments (PPMI) Dataset
28
Classification Detection
Playing saxophone
Playing bassoon
Playing saxophone
Playing French horn
Playing violin
vs.
Playing violin
Not playing violin
vs.
Playing different instruments
Playing vs. Not playing
For each interaction, 100 training and 100 testing images.
Classification: Playing Different Instruments
• 7-class classification on PPMI+ images
SPM: [Lazebnik et al, 2006]DPM: [Felzenszwalb et al, 2008]Constellation: [Fergus et al, 2003] [Niebles & Fei-Fei, 2007]
59.9%
54.9%
39.0%37.7%
Grouplet+SVM
SPMDPMConstel-lation
BoW
65.7%
Cla
ssifi
catio
n ac
cura
cy
0.7
0.6
0.5
0.4
29
1 2 3 4 5 60
200
400
600
800
1000
1200
Grouplet sizeN
o. o
f m
ined
Gro
uple
ts
Ave
rage
P
PM
I+ im
ages
Classifying Playing vs. Not playing
30
• Seven 2-class classification problem; PPMI+ vs. PPMI- for each instrument.
Ave
rage
P
PM
I- im
ages
Bassoon Erhu Flute French horn Guitar Saxophone Violin
Acc
urac
y
Grouplet+SVMDPM DPMBoW SPM
Bassoon Erhu Flute French horn Saxophone Violin
Ave
rage
P
PM
I+ im
ages
Classifying Playing vs. Not playing
31
• Seven 2-class classification problem; PPMI+ vs. PPMI- for each instrument.
Ave
rage
P
PM
I- im
ages
Bassoon Erhu Flute French horn Guitar Saxophone Violin
Acc
urac
y
Grouplet+SVMDPM DPMBoW SPM
Guitar
Detecting people playing musical instruments
32
• Face detection with a low threshold;
• Crop and normalize image regions;
• 8-class classification
Procedure:
Playing saxophone No playing No playing
- 7 classes of playing instruments;
- Another class of not playing any
instrument.
33
Detecting people playing musical instruments
Playing saxophone
Playing bassoon
Playing French horn
Playing saxophone
Playing French horn
Area under the precision-recall curve:
• Out method: 45.7%; • Spatial pyramid: 37.3%.
34
Detecting people playing musical instruments
Playing French horn
False detection Missed detection
Area under the precision-recall curve:
• Out method: 45.7%; • Spatial pyramid: 37.3%.
35
Examples of Mined Grouplets
Playing bassoon:
Playing saxophone:
Playing violin:
Playing guitar:
36
Conclusion
• Holistic image-based classification
Vs.
[B. Yao and L. Fei-Fei. “Modeling mutual context of object and human pose in human-object interaction activities.” CVPR 2010.]
[B. Yao and L. Fei-Fei. “Grouplet: A structured image representation for recognizing human and object interactions.” CVPR 2010.]
• Detailed understanding and reasoning
Pose estimation & object detection
The Next TalkThe Next Talk
Playing saxophone
Playing bassoon
Playing saxophone
Thanks toJuan Carlos Niebles, Jia Deng, Jia Li, Hao Su, Silvio Savarese, and anonymous reviewers.
And You
37