CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

Preview:

Citation preview

Modeling Mutual Context of Object

and Human Pose in Human-Object

Interaction Activities

Bangpeng Yao and Li Fei-Fei

Computer Science Department, Stanford University

{bangpeng,feifeili}@cs.stanford.edu

1

Robots interact

with objects

Automatic sports

commentary

“Kobe is dunking the ball.”

2

Human-Object Interaction

Medical care

3

Vs.

Human-Object Interaction

Playing

saxophone

Playing

bassoon

Playing

saxophone

Grouplet is a generic feature for structured objects, or interactions

of groups of objects.

(Previous talk: Grouplet)

Caltech101

HOI activity: Tennis Forehand

Holistic image based classification

Detailed understanding and reasoning

Berg & Malik, 2005 Grauman & Darrell, 2005 Gehler & Nowozin, 2009 OURS

48% 59% 77% 62%

4

Human-Object Interaction

Torso

Head

• Human pose estimation

Holistic image based classification

Detailed understanding and reasoning

5

Human-Object Interaction

Tennis

racket

• Human pose estimation

Holistic image based classification

Detailed understanding and reasoning

• Object detection

6

Human-Object Interaction

• Human pose estimation

Holistic image based classification

Detailed understanding and reasoning

• Object detection

Torso

Head

Tennis

racket

HOI activity: Tennis Forehand

• Background and Intuition

• Mutual Context of Object and Human Pose

Model Representation

Model Learning

Model Inference

• Experiments

• Conclusion

Outline

7

• Background and Intuition

• Mutual Context of Object and Human Pose

Model Representation

Model Learning

Model Inference

• Experiments

• Conclusion

Outline

8

• Felzenszwalb & Huttenlocher, 2005

• Ren et al, 2005

• Ramanan, 2006

• Ferrari et al, 2008

• Yang & Mori, 2008

• Andriluka et al, 2009

• Eichner & Ferrari, 2009

Difficult part

appearance

Self-occlusion

Image region looks

like a body part

Human pose estimation & Object detection

9

Human pose

estimation is

challenging.

Human pose estimation & Object detection

10

Human pose

estimation is

challenging.

• Felzenszwalb & Huttenlocher, 2005

• Ren et al, 2005

• Ramanan, 2006

• Ferrari et al, 2008

• Yang & Mori, 2008

• Andriluka et al, 2009

• Eichner & Ferrari, 2009

Human pose estimation & Object detection

11

Facilitate

Given the

object is

detected.

• Viola & Jones, 2001

• Lampert et al, 2008

• Divvala et al, 2009

• Vedaldi et al, 2009

Small, low-

resolution, partially

occluded

Image region similar

to detection target

Human pose estimation & Object detection

12

Object

detection is

challenging

Human pose estimation & Object detection

13

Object

detection is

challenging

• Viola & Jones, 2001

• Lampert et al, 2008

• Divvala et al, 2009

• Vedaldi et al, 2009

Human pose estimation & Object detection

14

Facilitate

Given the

pose is

estimated.

Human pose estimation & Object detection

15

Mutual Context

• Hoiem et al, 2006

• Rabinovich et al, 2007

• Oliva & Torralba, 2007

• Heitz & Koller, 2008

• Desai et al, 2009

• Divvala et al, 2009

• Murphy et al, 2003

• Shotton et al, 2006

• Harzallah et al, 2009

• Li, Socher & Fei-Fei, 2009

• Marszalek et al, 2009

• Bao & Savarese, 2010

Context in Computer Vision

~3-4%

with

context

without

context

Helpful, but only moderately

outperform better

Previous work – Use context

cues to facilitate object detection:

• Viola & Jones, 2001

• Lampert et al, 2008

16

Context in Computer Vision

Our approach – Two challenging

tasks serve as mutual context of

each other:

With

mutual

context:

Without

context:

17

~3-4%

with

context

without

context

Helpful, but only moderately

outperform better

Previous work – Use context

cues to facilitate object detection:

• Hoiem et al, 2006

• Rabinovich et al, 2007

• Oliva & Torralba, 2007

• Heitz & Koller, 2008

• Desai et al, 2009

• Divvala et al, 2009

• Murphy et al, 2003

• Shotton et al, 2006

• Harzallah et al, 2009

• Li, Socher & Fei-Fei, 2009

• Marszalek et al, 2009

• Bao & Savarese, 2010

• Background and Intuition

• Mutual Context of Object and Human Pose

Model Representation

Model Learning

Model Inference

• Experiments

• Conclusion

Outline

18

19

H

A

Mutual Context Model Representation

• More than one H for each A;

• Unobserved during training.

A:

Croquet

shot

Volleyball

smash

Tennis

forehand

Intra-class variations

Activity

Object

Human pose

Body parts

lP: location; θP: orientation; sP: scale.

Croquet

malletVolleyball

Tennis

racket

O:

H:

P:

f: Shape context. [Belongie et al, 2002]

P1

Image evidence

fO

f1 f2 fN

O

P2 PN

20

Mutual Context Model Representation

( , )e O H

( , )e A O

( , )e A H

e e

e E

w

Markov Random Field

Clique

potential

Clique

weight

O

P1 PN

fO

H

A

P2

f1 f2 fN

( , )e A O ( , )e A H ( , )e O H• , , : Frequency

of co-occurrence between A, O, and H.

21

A

f1 f2 fN

Mutual Context Model Representation

( , )e nO P

( , )e m nP P

fO

P1 PNP2

O

H• , , : Spatial

relationship among object and body parts.

( , )e nO P ( , )e m nP P( , )e nH P

bin binn n nO P O P O Pl l s s

location orientation size

( , )e nH P

e e

e E

w

Markov Random Field

Clique

potential

Clique

weight

( , )e A O ( , )e A H ( , )e O H• , , : Frequency

of co-occurrence between A, O, and H.

22

H

A

f1 f2 fN

Mutual Context Model Representation

Obtained by

structure learning

fO

PNP1 P2

O

• Learn structural connectivity among

the body parts and the object.

( , )e A O ( , )e A H ( , )e O H• , , : Frequency

of co-occurrence between A, O, and H.

• , , : Spatial

relationship among object and body parts.

( , )e nO P ( , )e m nP P( , )e nH P

bin binn n nO P O P O Pl l s s

location orientation size ( , )e nO P

( , )e m nP P

( , )e nH P

e e

e E

w

Markov Random Field

Clique

potential

Clique

weight

23

H

O

A

fO

f1 f2 fN

P1 P2 PN

Mutual Context Model Representation

• and : Discriminative

part detection scores.

( , )e OO f ( , )ne n PP f

[Andriluka et al, 2009]

Shape context + AdaBoost

• Learn structural connectivity among

the body parts and the object.

[Belongie et al, 2002]

[Viola & Jones, 2001]

( , )e OO f

( , )ne n PP f

( , )e A O ( , )e A H ( , )e O H• , , : Frequency

of co-occurrence between A, O, and H.

• , , : Spatial

relationship among object and body parts.

( , )e nO P ( , )e m nP P( , )e nH P

bin binn n nO P O P O Pl l s s

location orientation size

e e

e E

w

Markov Random Field

Clique

potential

Clique

weight

• Background and Intuition

• Mutual Context of Object and Human Pose

Model Representation

Model Learning

Model Inference

• Experiments

• Conclusion

Outline

24

25

Model Learning

H

O

A

fO

f1 f2 fN

P1 P2 PN

e e

e E

w

cricket

shot

cricket

bowling

Input:

Goals:

Hidden human poses

26

Model Learning

H

O

A

fO

f1 f2 fN

P1 P2 PN

Input:

Goals:

Hidden human poses

Structural connectivity

e e

e E

w

cricket

shot

cricket

bowling

e e

e E

w

27

Model Learning

Goals:

Hidden human poses

Structural connectivity

Potential parameters

Potential weights

H

O

A

fO

f1 f2 fN

P1 P2 PN

Input:

cricket

shot

cricket

bowling

28

Model Learning

Goals:

Parameter estimation

Hidden variables

Structure learning

H

O

A

fO

f1 f2 fN

P1 P2 PN

Input:e e

e E

w

cricket

shot

cricket

bowling

Hidden human poses

Structural connectivity

Potential parameters

Potential weights

29

Model Learning

Goals:

H

O

A

fO

f1 f2 fN

P1 P2 PN

Approach:

croquet shot

e e

e E

w

Hidden human poses

Structural connectivity

Potential parameters

Potential weights

30

Model Learning

Goals:

H

O

A

fO

f1 f2 fN

P1 P2 PN

Approach:

2

2max

2e eeE e

Ew

Joint density

of the model

Gaussian priori of

the edge number

Hill-climbing

e e

e E

w

Hidden human poses

Structural connectivity

Potential parameters

Potential weights

31

Model Learning

Goals:

H

O

A

fO

f1 f2 fN

P1 P2 PN

Approach:

( , )e O H( , )e A O ( , )e A H

( , )e nO P ( , )e m nP P( , )e nH P

( , )e OO f ( , )ne n PP f

• Maximum likelihood

• Standard AdaBoost

e e

e E

w

Hidden human poses

Structural connectivity

Potential parameters

Potential weights

32

Model Learning

Goals:

H

O

A

fO

f1 f2 fN

P1 P2 PN

Approach:

Max-margin learning

2

2,

1min

2r i

r i

w

w

• xi: Potential values of the i-th image.

• wr: Potential weights of the r-th pose.

• y(r): Activity of the r-th pose.

• ξi: A slack variable for the i-th image.

Notations

s.t. , where ,

1

, 0

i

i

c i r i i

i

i r y r y c

i

w x w x

e e

e E

w

Hidden human poses

Structural connectivity

Potential parameters

Potential weights

33

Learning Results

Cricket

defensive

shot

Cricket

bowling

Croquet

shot

34

Learning Results

Tennis

serve

Volleyball

smash

Tennis

forehand

• Background and Intuition

• Mutual Context of Object and Human Pose

Model Representation

Model Learning

Model Inference

• Experiments

• Conclusion

Outline

35

I

36

Model Inference

The learned models

I

37

Model Inference

The learned models

Head detection

Torso detection

Tennis racket detection

Layout of the object and body parts.

Compositional

Inference

[Chen et al, 2007]

* *

1 1 1 1,, , , nn

A H O P

I

38

Model Inference

The learned models

* *

1 1 1 1,, , , nn

A H O P * *

,, , ,K K K K nn

A H O P

Output

• Background and Intuition

• Mutual Context of Object and Human Pose

Model Representation

Model Learning

Model Inference

• Experiments

• Conclusion

Outline

39

40

Dataset and Experiment Setup

• Object detection;

• Pose estimation;

• Activity classification.

Tasks:

[Gupta et al, 2009]

Cricket

defensive shot

Cricket

bowling

Croquet

shot

Tennis

forehand

Tennis

serve

Volleyball

smash

Sport data set: 6 classes

180 training (supervised with object and part locations) & 120 testing images

[Gupta et al, 2009]

Cricket

defensive shot

Cricket

bowling

Croquet

shot

Tennis

forehand

Tennis

serve

Volleyball

smash

Sport data set: 6 classes

41

Dataset and Experiment Setup

• Object detection;

• Pose estimation;

• Activity classification.

Tasks:

180 training (supervised with object and part locations) & 120 testing images

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

Object Detection Results

Cricket bat

42

Valid

region

Croquet mallet Tennis racket Volleyball

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

Cricket ball

Our

Method

Sliding

window

Pedestrian

context

[Andriluka

et al, 2009]

[Dalal &

Triggs, 2006]

Object Detection Results

43

430 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

Volleyball

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

Cricket ball

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

RecallP

recis

ion

Our Method

Pedestrian as context

Scanning window detector

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

Our Method

Pedestrian as context

Scanning window detector

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

Our Method

Pedestrian as context

Scanning window detectorSliding window Pedestrian context Our method

Sm

all

ob

jec

tB

ac

kg

rou

nd

clu

tte

r

44

Dataset and Experiment Setup

• Object detection;

• Pose estimation;

• Activity classification.

Tasks:

[Gupta et al, 2009]

Cricket

defensive shot

Cricket

bowling

Croquet

shot

Tennis

forehand

Tennis

serve

Volleyball

smash

Sport data set: 6 classes

180 training & 120 testing images

45

Human Pose Estimation Results

Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head

Ramanan,

2006.52 .22 .22 .21 .28 .24 .28 .17 .14 .42

Andriluka et

al, 2009.50 .31 .30 .31 .27 .18 .19 .11 .11 .45

Our full

model.66 .43 .39 .44 .34 .44 .40 .27 .29 .58

46

Human Pose Estimation Results

Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head

Ramanan,

2006.52 .22 .22 .21 .28 .24 .28 .17 .14 .42

Andriluka et

al, 2009.50 .31 .30 .31 .27 .18 .19 .11 .11 .45

Our full

model.66 .43 .39 .44 .34 .44 .40 .27 .29 .58

Andriluka

et al, 2009

Our estimation

result

Tennis serve

modelAndriluka

et al, 2009

Our estimation

result

Volleyball

smash model

47

Human Pose Estimation Results

Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head

Ramanan,

2006.52 .22 .22 .21 .28 .24 .28 .17 .14 .42

Andriluka et

al, 2009.50 .31 .30 .31 .27 .18 .19 .11 .11 .45

Our full

model.66 .43 .39 .44 .34 .44 .40 .27 .29 .58

One pose

per class.63 .40 .36 .41 .31 .38 .35 .21 .23 .52

Estimation

result

Estimation

result

Estimation

result

Estimation

result

48

Dataset and Experiment Setup

• Object detection;

• Pose estimation;

• Activity classification.

Tasks:

[Gupta et al, 2009]

Cricket

defensive shot

Cricket

bowling

Croquet

shot

Tennis

forehand

Tennis

serve

Volleyball

smash

Sport data set: 6 classes

180 training & 120 testing images

Activity Classification Results

49

Gupta et

al, 2009

Our

model

Bag-of-

Words

83.3%

Cla

ssific

atio

n a

ccu

racy

78.9%

52.5%

0.9

0.8

0.7

0.6

0.5

No scene

information Scene is

critical!! Cricket

shot

Tennis

forehand

Bag-of-words

SIFT+SVM

Gupta et

al, 2009

Our

model

50

Conclusion

Human-Object Interaction

Next Steps

Vs.

• Pose estimation & Object detection on PPMI images.

• Modeling multiple objects and humans.

Grouplet representation

Mutual context model

Acknowledgment• Stanford Vision Lab reviewers:

– Barry Chai (1985-2010)

– Juan Carlos Niebles

– Hao Su

• Silvio Savarese, U. Michigan

• Anonymous reviewers

51

Recommended