51
Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses Presented By Arwa Chittalwala Irfan Shaikh Heena Patel 1

Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

  • Upload
    -

  • View
    470

  • Download
    0

Embed Size (px)

DESCRIPTION

Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses

Citation preview

Page 1: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

Recognizing Human-Object Interactions inStill Images by Modeling the Mutual

Contextof Objects and Human Poses

Presented By

Arwa Chittalwala

Irfan Shaikh

Heena Patel

1

Page 2: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

Robots interact with objects

Automatic sports commentary

“Kobe is dunking the ball.”

2

Human-Object Interaction

Medical care

Page 3: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

3

Vs.

Human-Object Interaction

Playing saxophone

Playing bassoon

Playing saxophone

Grouplet is a generic feature for structured objects, or interactions of groups of objects.

(Previous talk: Grouplet)

Caltech101

HOI activity: Tennis Forehand

Holistic image based classification

Detailed understanding and reasoning

Berg & Malik, 2005 Grauman & Darrell, 2005 Gehler & Nowozin, 2009 OURS

48% 59% 77% 62%

Page 4: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

4

Human-Object Interaction

TorsoRight-armLeft-a

rmRig

ht-le

g

Left-leg

Head

• Human pose estimation

Holistic image based classification

Detailed understanding and reasoning

Page 5: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

5

Human-Object Interaction

Tennis racket

• Human pose estimation

Holistic image based classification

Detailed understanding and reasoning

• Object detection

Page 6: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

6

Human-Object Interaction

• Human pose estimation

Holistic image based classification

Detailed understanding and reasoning

• Object detection

TorsoRight-armLeft-a

rmRig

ht-le

g

Left-leg

Head

Tennis racket

HOI activity: Tennis Forehand

Page 7: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

• Background and Intuition

• Mutual Context of Object and Human Pose Model Representation

Model Learning

Model Inference

• Experiments

• Conclusion

Outline

7

Page 8: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

• Background and Intuition

• Mutual Context of Object and Human Pose Model Representation

Model Learning

Model Inference

• Experiments

• Conclusion

Outline

8

Page 9: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

• Felzenszwalb & Huttenlocher, 2005

• Ren et al, 2005• Ramanan, 2006• Ferrari et al, 2008• Yang & Mori, 2008• Andriluka et al, 2009• Eichner & Ferrari, 2009

Difficult part appearance

Self-occlusion

Image region looks like a body part

Human pose estimation & Object detection

9

Human pose estimation is challenging.

Page 10: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

Human pose estimation & Object detection

10

Human pose estimation is challenging.

• Felzenszwalb & Huttenlocher, 2005

• Ren et al, 2005• Ramanan, 2006• Ferrari et al, 2008• Yang & Mori, 2008• Andriluka et al, 2009• Eichner & Ferrari, 2009

Page 11: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

Human pose estimation & Object detection

11

Facilitate

Given the object is detected.

Page 12: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

• Viola & Jones, 2001

• Lampert et al, 2008

• Divvala et al, 2009• Vedaldi et al, 2009

Small, low-resolution, partially occluded

Image region similar to detection target

Human pose estimation & Object detection

12

Object detection is challenging

Page 13: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

Human pose estimation & Object detection

13

Object detection is challenging

• Viola & Jones, 2001

• Lampert et al, 2008

• Divvala et al, 2009• Vedaldi et al, 2009

Page 14: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

Human pose estimation & Object detection

14

Facilitate

Given the pose is estimated.

Page 15: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

Human pose estimation & Object detection

15

Mutual Context

Page 16: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

• Hoiem et al, 2006• Rabinovich et al, 2007• Oliva & Torralba, 2007• Heitz & Koller, 2008• Desai et al, 2009• Divvala et al, 2009

• Murphy et al, 2003• Shotton et al, 2006• Harzallah et al, 2009• Li, Socher & Fei-Fei, 2009• Marszalek et al, 2009• Bao & Savarese, 2010

Context in Computer Vision

~3-4%

with context

without context

Helpful, but only moderately outperform better

Previous work – Use context cues to facilitate object detection:

• Viola & Jones, 2001• Lampert et al, 2008

16

Page 17: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

Context in Computer Vision

Our approach – Two challenging tasks serve as mutual context of each other:

With mutual context:

Without context:

17

~3-4%

with context

without context

Helpful, but only moderately outperform better

Previous work – Use context cues to facilitate object detection:

• Hoiem et al, 2006• Rabinovich et al, 2007• Oliva & Torralba, 2007• Heitz & Koller, 2008• Desai et al, 2009• Divvala et al, 2009

• Murphy et al, 2003• Shotton et al, 2006• Harzallah et al, 2009• Li, Socher & Fei-Fei, 2009• Marszalek et al, 2009• Bao & Savarese, 2010

Page 18: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

• Background and Intuition

• Mutual Context of Object and Human Pose Model Representation

Model Learning

Model Inference

• Experiments

• Conclusion

Outline

18

Page 19: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

19

H

A

Mutual Context Model Representation

• More than one H for each A;

• Unobserved during training.

A:

Croquet shot

Volleyball smash

Tennis forehand

Intra-class variations

Activity

Object

Human pose

Body parts

lP: location; θP: orientation; sP: scale.

Croquet mallet

Volleyball

Tennis racket

O:

H:

P:

f: Shape context. [Belongie et al, 2002]

P1

Image evidence

fO

f1 f2 fN

O

P2 PN

Page 20: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

20

Mutual Context Model Representation

( , )e O H

( , )e A O( , )e A H

e ee E

w

Markov Random Field

Clique potential

Clique weight

O

P1 PN

fO

H

A

P2

f1 f2 fN

( , )e A O ( , )e A H ( , )e O H• , , : Frequency of co-occurrence between A, O, and H.

Page 21: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

21

A

f1 f2 fN

Mutual Context Model Representation

( , )e nO P

( , )e m nP P

fO

P1 PNP2

O

H• , , : Spatial

relationship among object and body parts.

( , )e nO P ( , )e m nP P( , )e nH P

bin binn n nO P O P O Pl l s s

location orientation size

( , )e nH P

e ee E

w

Markov Random Field

Clique potential

Clique weight

( , )e A O ( , )e A H ( , )e O H• , , : Frequency of co-occurrence between A, O, and H.

Page 22: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

22

H

A

f1 f2 fN

Mutual Context Model Representation

Obtained by structure learning

fO

PNP1 P2

O

• Learn structural connectivity among the body parts and the object.

( , )e A O ( , )e A H ( , )e O H• , , : Frequency of co-occurrence between A, O, and H.

• , , : Spatial relationship among object and body parts.

( , )e nO P ( , )e m nP P( , )e nH P

bin binn n nO P O P O Pl l s s

location orientation size ( , )e nO P

( , )e m nP P

( , )e nH P

e ee E

w

Markov Random Field

Clique potential

Clique weight

Page 23: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

23

H

O

A

fO

f1 f2 fN

P1 P2 PN

Mutual Context Model Representation

• and : Discriminative part detection scores.( , )e OO f ( , )

ne n PP f

[Andriluka et al, 2009]

Shape context + AdaBoost

• Learn structural connectivity among the body parts and the object.

[Belongie et al, 2002][Viola & Jones, 2001]

( , )e OO f

( , )ne n PP f

( , )e A O ( , )e A H ( , )e O H• , , : Frequency of co-occurrence between A, O, and H.

• , , : Spatial relationship among object and body parts.

( , )e nO P ( , )e m nP P( , )e nH P

bin binn n nO P O P O Pl l s s

location orientation size

e ee E

w

Markov Random Field

Clique potential

Clique weight

Page 24: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

• Background and Intuition

• Mutual Context of Object and Human Pose Model Representation

Model Learning

Model Inference

• Experiments

• Conclusion

Outline

24

Page 25: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

25

Model Learning

H

O

A

fO

f1 f2 fN

P1 P2 PN

e ee E

w

cricket shot

cricket bowling

Input:

Goals:

Hidden human poses

Page 26: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

26

Model Learning

H

O

A

fO

f1 f2 fN

P1 P2 PN

Input:

Goals:

Hidden human poses

Structural connectivity

e ee E

w

cricket shot

cricket bowling

Page 27: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

e ee E

w

27

Model Learning

Goals:

Hidden human poses

Structural connectivity

Potential parameters

Potential weights

H

O

A

fO

f1 f2 fN

P1 P2 PN

Input:

cricket shot

cricket bowling

Page 28: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

28

Model Learning

Goals:

Parameter estimation

Hidden variables

Structure learning

H

O

A

fO

f1 f2 fN

P1 P2 PN

Input:e e

e E

w

cricket shot

cricket bowling

Hidden human poses

Structural connectivity

Potential parameters

Potential weights

Page 29: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

29

Model Learning

Goals:

H

O

A

fO

f1 f2 fN

P1 P2 PN

Approach:

croquet shot

e ee E

w

Hidden human poses

Structural connectivity

Potential parameters

Potential weights

Page 30: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

30

Model Learning

Goals:

H

O

A

fO

f1 f2 fN

P1 P2 PN

Approach:

22

max2e eeE e

Ew

Joint density of the model

Gaussian priori of the edge number

Add a

n ed

ge

Remove

an edge

Add a

n ed

ge

Remove

an edge

Hill-climbing

e ee E

w

Hidden human poses

Structural connectivity

Potential parameters

Potential weights

Page 31: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

31

Model Learning

Goals:

H

O

A

fO

f1 f2 fN

P1 P2 PN

Approach:

( , )e O H( , )e A O ( , )e A H( , )e nO P ( , )e m nP P( , )e nH P

( , )e OO f ( , )ne n PP f

• Maximum likelihood

• Standard AdaBoost

e ee E

w

Hidden human poses

Structural connectivity

Potential parameters

Potential weights

Page 32: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

32

Model Learning

Goals:

H

O

A

fO

f1 f2 fN

P1 P2 PN

Approach:

Max-margin learning

2

2,

1min

2 r ir i

ww

• xi: Potential values of the i-th image.

• wr: Potential weights of the r-th pose.

• y(r): Activity of the r-th pose.• ξi: A slack variable for the i-th

image.

Notations

s.t. , where ,

1

, 0i

i

c i r i i

i

i r y r y c

i

w x w x

e ee E

w

Hidden human poses

Structural connectivity

Potential parameters

Potential weights

Page 33: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

33

Learning Results

Cricket defensive

shot

Cricket bowling

Croquet shot

Page 34: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

34

Learning Results

Tennis serve

Volleyball smash

Tennis forehand

Page 35: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

• Background and Intuition

• Mutual Context of Object and Human Pose Model Representation

Model Learning

Model Inference

• Experiments

• Conclusion

Outline

35

Page 36: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

I

36

Model Inference

The learned models

Page 37: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

I

37

Model Inference

The learned models

Head detection

Torso detection

Tennis racket detection

Layout of the object and body parts.

Compositional Inference

[Chen et al, 2007]

* *1 1 1 1,, , , n nA H O P

Page 38: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

I

38

Model Inference

The learned models

* *1 1 1 1,, , , n nA H O P * *

,, , ,K K K K n nA H O P

Output

Page 39: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

• Background and Intuition

• Mutual Context of Object and Human Pose Model Representation

Model Learning

Model Inference

• Experiments

• Conclusion

Outline

39

Page 40: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

40

Dataset and Experiment Setup

• Object detection;• Pose estimation;• Activity classification.

Tasks:

[Gupta et al, 2009]

Cricket defensive shot

Cricket bowling

Croquet shot

Tennis forehand

Tennis serve

Volleyball smash

Sport data set: 6 classes180 training (supervised with object and part locations) & 120 testing images

Page 41: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

[Gupta et al, 2009]

Cricket defensive shot

Cricket bowling

Croquet shot

Tennis forehand

Tennis serve

Volleyball smash

Sport data set: 6 classes

41

Dataset and Experiment Setup

• Object detection;• Pose estimation;• Activity classification.

Tasks:

180 training (supervised with object and part locations) & 120 testing images

Page 42: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cisi

on

Object Detection Results

Cricket bat

42

Valid region

Croquet mallet Tennis racket Volleyball

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cisi

on

Cricket ball

Our Method

Sliding window

Pedestrian context

[Andriluka et al, 2009]

[Dalal & Triggs, 2006]

Page 43: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

Object Detection Results

43

430 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cisi

on

Volleyball

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cisi

on

Cricket ball

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

RecallP

reci

sion

Our MethodPedestrian as contextScanning window detector

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cisi

on

Our MethodPedestrian as contextScanning window detector

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cisi

on

Our MethodPedestrian as contextScanning window detector

Sliding window Pedestrian context Our method

Sm

all

ob

jec

tB

ac

kg

rou

nd

clu

tte

r

Page 44: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

44

Dataset and Experiment Setup

• Object detection;• Pose estimation;• Activity classification.

Tasks:

[Gupta et al, 2009]

Cricket defensive shot

Cricket bowling

Croquet shot

Tennis forehand

Tennis serve

Volleyball smash

Sport data set: 6 classes180 training & 120 testing images

Page 45: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

45

Human Pose Estimation Results

Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head

Ramanan, 2006 .52 .22 .22 .21 .28 .24 .28 .17 .14 .42

Andriluka et al, 2009 .50 .31 .30 .31 .27 .18 .19 .11 .11 .45

Our full model .66 .43 .39 .44 .34 .44 .40 .27 .29 .58

Page 46: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

46

Human Pose Estimation Results

Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head

Ramanan, 2006 .52 .22 .22 .21 .28 .24 .28 .17 .14 .42

Andriluka et al, 2009 .50 .31 .30 .31 .27 .18 .19 .11 .11 .45

Our full model .66 .43 .39 .44 .34 .44 .40 .27 .29 .58

Andriluka et al, 2009

Our estimation result

Tennis serve model

Andriluka et al, 2009

Our estimation result

Volleyball smash model

Page 47: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

47

Human Pose Estimation Results

Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head

Ramanan, 2006 .52 .22 .22 .21 .28 .24 .28 .17 .14 .42

Andriluka et al, 2009 .50 .31 .30 .31 .27 .18 .19 .11 .11 .45

Our full model .66 .43 .39 .44 .34 .44 .40 .27 .29 .58

One pose per class .63 .40 .36 .41 .31 .38 .35 .21 .23 .52

Estimation result

Estimation result

Estimation result

Estimation result

Page 48: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

48

Dataset and Experiment Setup

• Object detection;• Pose estimation;• Activity classification.

Tasks:

[Gupta et al, 2009]

Cricket defensive shot

Cricket bowling

Croquet shot

Tennis forehand

Tennis serve

Volleyball smash

Sport data set: 6 classes180 training & 120 testing images

Page 49: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

Activity Classification Results

49

Gupta et al, 2009

Our model

Bag-of-Words

83.3%

Cla

ssifi

catio

n a

ccu

racy 78.9%

52.5%

0.9

0.8

0.7

0.6

0.5

No scene information Scene is

critical!! Cricket shot

Tennis forehand

Bag-of-wordsSIFT+SVM

Gupta et al, 2009

Our model

Page 50: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

50

ConclusionHuman-Object Interaction

Next Steps

Vs.

• Pose estimation & Object detection on PPMI images.

• Modeling multiple objects and humans.

Grouplet representation

Mutual context model

Page 51: Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

51