CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

Modeling Mutual Context of Object

and Human Pose in Human-Object

Interaction Activities

Bangpeng Yao and Li Fei-Fei

Computer Science Department, Stanford University

{bangpeng,feifeili}@cs.stanford.edu

Robots interact

with objects

Automatic sports

commentary

“Kobe is dunking the ball.”

Human-Object Interaction

Medical care

Playing

saxophone

Playing

bassoon

Playing

saxophone

Grouplet is a generic feature for structured objects, or interactions

of groups of objects.

(Previous talk: Grouplet)

Caltech101

HOI activity: Tennis Forehand

Holistic image based classification

Detailed understanding and reasoning

Berg & Malik, 2005 Grauman & Darrell, 2005 Gehler & Nowozin, 2009 OURS

48% 59% 77% 62%

• Human pose estimation

Tennis

racket

• Object detection

Tennis

racket

HOI activity: Tennis Forehand

• Background and Intuition

• Mutual Context of Object and Human Pose

Model Representation

Model Learning

Model Inference

• Experiments

• Conclusion

Outline

Model Learning

Model Inference

• Experiments

• Conclusion

Outline

• Felzenszwalb & Huttenlocher, 2005

• Ren et al, 2005

• Ramanan, 2006

• Ferrari et al, 2008

• Yang & Mori, 2008

• Andriluka et al, 2009

• Eichner & Ferrari, 2009

Difficult part

appearance

Self-occlusion

Image region looks

like a body part

Human pose estimation & Object detection

Human pose

estimation is

challenging.

Human pose

estimation is

challenging.

• Felzenszwalb & Huttenlocher, 2005

• Ren et al, 2005

• Ramanan, 2006

• Ferrari et al, 2008

• Yang & Mori, 2008

• Andriluka et al, 2009

• Eichner & Ferrari, 2009

Facilitate

Given the

object is

detected.

• Viola & Jones, 2001

• Lampert et al, 2008

• Divvala et al, 2009

• Vedaldi et al, 2009

Small, low-

resolution, partially

occluded

Image region similar

to detection target

Object

detection is

challenging

Object

detection is

challenging

• Vedaldi et al, 2009

Facilitate

Given the

pose is

estimated.

Mutual Context

• Hoiem et al, 2006

• Rabinovich et al, 2007

• Oliva & Torralba, 2007

• Heitz & Koller, 2008

• Desai et al, 2009

• Murphy et al, 2003

• Shotton et al, 2006

• Harzallah et al, 2009

• Li, Socher & Fei-Fei, 2009

• Marszalek et al, 2009

• Bao & Savarese, 2010

Context in Computer Vision

context

without

context

Helpful, but only moderately

outperform better

Previous work – Use context

cues to facilitate object detection:

Context in Computer Vision

Our approach – Two challenging

tasks serve as mutual context of

each other:

mutual

context:

Without

context:

context

without

context

Helpful, but only moderately

outperform better

Previous work – Use context

cues to facilitate object detection:

• Hoiem et al, 2006

• Rabinovich et al, 2007

• Oliva & Torralba, 2007

• Heitz & Koller, 2008

• Desai et al, 2009

• Murphy et al, 2003

• Shotton et al, 2006

• Harzallah et al, 2009

• Li, Socher & Fei-Fei, 2009

• Marszalek et al, 2009

• Bao & Savarese, 2010

Model Learning

Model Inference

• Experiments

• Conclusion

Outline

Mutual Context Model Representation

• More than one H for each A;

• Unobserved during training.

Croquet

Volleyball

Tennis

forehand

Intra-class variations

Activity

Object

Human pose

Body parts

lP: location; θP: orientation; sP: scale.

Croquet

malletVolleyball

Tennis

racket

f: Shape context. [Belongie et al, 2002]

Image evidence

f1 f2 fN

( , )e O H

( , )e A O

( , )e A H

Markov Random Field

Clique

potential

Clique

weight

f1 f2 fN

( , )e A O ( , )e A H ( , )e O H• , , : Frequency

of co-occurrence between A, O, and H.

f1 f2 fN

( , )e nO P

( , )e m nP P

P1 PNP2

H• , , : Spatial

relationship among object and body parts.

( , )e nO P ( , )e m nP P( , )e nH P

bin binn n nO P O P O Pl l s s

location orientation size

( , )e nH P

Markov Random Field

Clique

potential

Clique

weight

f1 f2 fN

Obtained by

structure learning

PNP1 P2

• Learn structural connectivity among

the body parts and the object.

• , , : Spatial

( , )e nO P ( , )e m nP P( , )e nH P

location orientation size ( , )e nO P

( , )e m nP P

( , )e nH P

Markov Random Field

Clique

potential

Clique

weight

f1 f2 fN

P1 P2 PN

• and : Discriminative

part detection scores.

( , )e OO f ( , )ne n PP f

[Andriluka et al, 2009]

Shape context + AdaBoost

• Learn structural connectivity among

the body parts and the object.

[Belongie et al, 2002]

[Viola & Jones, 2001]

( , )e OO f

( , )ne n PP f

• , , : Spatial

( , )e nO P ( , )e m nP P( , )e nH P

location orientation size

Markov Random Field

Clique

potential

Clique

weight

Model Learning

Model Inference

• Experiments

• Conclusion

Outline

Model Learning

f1 f2 fN

P1 P2 PN

cricket

bowling

Input:

Goals:

Hidden human poses

Model Learning

f1 f2 fN

P1 P2 PN

Input:

Goals:

Hidden human poses

Structural connectivity

cricket

bowling

Model Learning

Goals:

Hidden human poses

Potential parameters

Potential weights

f1 f2 fN

P1 P2 PN

Input:

cricket

bowling

Model Learning

Goals:

Parameter estimation

Hidden variables

Structure learning

f1 f2 fN

P1 P2 PN

Input:e e

cricket

bowling

Hidden human poses

Potential weights

Model Learning

Goals:

f1 f2 fN

P1 P2 PN

Approach:

croquet shot

Hidden human poses

Potential weights

Model Learning

Goals:

f1 f2 fN

P1 P2 PN

Approach:

2e eeE e

Joint density

of the model

Gaussian priori of

the edge number

Hill-climbing

Hidden human poses

Potential weights

Model Learning

Goals:

f1 f2 fN

P1 P2 PN

Approach:

( , )e O H( , )e A O ( , )e A H

( , )e nO P ( , )e m nP P( , )e nH P

( , )e OO f ( , )ne n PP f

• Maximum likelihood

• Standard AdaBoost

Hidden human poses

Potential weights

Model Learning

Goals:

f1 f2 fN

P1 P2 PN

Approach:

Max-margin learning

• xi: Potential values of the i-th image.

• wr: Potential weights of the r-th pose.

• y(r): Activity of the r-th pose.

• ξi: A slack variable for the i-th image.

Notations

s.t. , where ,

c i r i i

i r y r y c

w x w x

Hidden human poses

Potential weights

Learning Results

Cricket

defensive

Cricket

bowling

Croquet

Learning Results

Tennis

Volleyball

Tennis

forehand

Model Learning

Model Inference

• Experiments

• Conclusion

Outline

Model Inference

The learned models

Model Inference

The learned models

Head detection

Torso detection

Tennis racket detection

Layout of the object and body parts.

Compositional

Inference

[Chen et al, 2007]

1 1 1 1,, , , nn

A H O P

Model Inference

The learned models

1 1 1 1,, , , nn

A H O P * *

,, , ,K K K K nn

A H O P

Output

Model Learning

Model Inference

• Experiments

• Conclusion

Outline

Dataset and Experiment Setup

• Object detection;

• Pose estimation;

• Activity classification.

Tasks:

[Gupta et al, 2009]

Cricket

defensive shot

Cricket

bowling

Croquet

Tennis

forehand

Tennis

Volleyball

Sport data set: 6 classes

180 training (supervised with object and part locations) & 120 testing images

[Gupta et al, 2009]

Cricket

defensive shot

Cricket

bowling

Croquet

Tennis

forehand

Tennis

Volleyball

Tasks:

180 training (supervised with object and part locations) & 120 testing images

0 0.2 0.4 0.6 0.8 10

Recall

Object Detection Results

Cricket bat

region

Croquet mallet Tennis racket Volleyball

0 0.2 0.4 0.6 0.8 10

Recall

Cricket ball

Method

Sliding

window

Pedestrian

context

[Andriluka

et al, 2009]

[Dalal &

Triggs, 2006]

Object Detection Results

430 0.2 0.4 0.6 0.8 1

Recall

Volleyball

0 0.2 0.4 0.6 0.8 10

Recall

Cricket ball

0 0.2 0.4 0.6 0.8 10

RecallP

Our Method

Pedestrian as context

Scanning window detector

0 0.2 0.4 0.6 0.8 10

Recall

Our Method

Scanning window detector

0 0.2 0.4 0.6 0.8 10

Recall

Our Method

Scanning window detectorSliding window Pedestrian context Our method

Tasks:

[Gupta et al, 2009]

Cricket

defensive shot

Cricket

bowling

Croquet

Tennis

forehand

Tennis

Volleyball

180 training & 120 testing images

Human Pose Estimation Results

Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head

Ramanan,

2006.52 .22 .22 .21 .28 .24 .28 .17 .14 .42

Andriluka et

al, 2009.50 .31 .30 .31 .27 .18 .19 .11 .11 .45

Our full

model.66 .43 .39 .44 .34 .44 .40 .27 .29 .58

Ramanan,

2006.52 .22 .22 .21 .28 .24 .28 .17 .14 .42

Andriluka et

al, 2009.50 .31 .30 .31 .27 .18 .19 .11 .11 .45

Our full

model.66 .43 .39 .44 .34 .44 .40 .27 .29 .58

Andriluka

et al, 2009

Our estimation

result

Tennis serve

modelAndriluka

et al, 2009

Our estimation

result

Volleyball

smash model

Ramanan,

2006.52 .22 .22 .21 .28 .24 .28 .17 .14 .42

Andriluka et

al, 2009.50 .31 .30 .31 .27 .18 .19 .11 .11 .45

Our full

model.66 .43 .39 .44 .34 .44 .40 .27 .29 .58

One pose

per class.63 .40 .36 .41 .31 .38 .35 .21 .23 .52

Estimation

result

Estimation

result

Estimation

result

Estimation

result

Tasks:

[Gupta et al, 2009]

Cricket

defensive shot

Cricket

bowling

Croquet

Tennis

forehand

Tennis

Volleyball

180 training & 120 testing images

Activity Classification Results

Gupta et

al, 2009

Bag-of-

ssific

No scene

information Scene is

critical!! Cricket

Tennis

forehand

Bag-of-words

SIFT+SVM

Gupta et

al, 2009

Conclusion

Next Steps

• Pose estimation & Object detection on PPMI images.

• Modeling multiple objects and humans.

Grouplet representation

Mutual context model

Acknowledgment• Stanford Vision Lab reviewers:

– Barry Chai (1985-2010)

– Juan Carlos Niebles

– Hao Su

• Silvio Savarese, U. Michigan

• Anonymous reviewers

CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

Technology

Scaling Human-Object Interaction Recognition through …vision.stanford.edu/pdf/shen2018wacv.pdf · Scaling Human-Object Interaction Recognition through Zero ... we address the challenge

Human-humanoid collaborative object transportation

Spatiotemporal Graphs for Object Segmentation, Human Pose

cvpr2010 tutorial: video search engines

Centerâ€“periphery organization of human object areas

Estimating unknown object dynamics in human-robot

ICML2011: recognizing human-object interaction activities

CVPR2010: Learnings from founding a computer vision startup: Chapter 0: Introduction

Reasoning About Human-Object Interactions …openaccess.thecvf.com/content_ICCV_2019/papers/Xiao...Reasoning About Human-Object Interactions Through Dual Attention Networks Tete Xiao1,2∗

CVPR2010: Semi-supervised Learning in Vision: Part 1: Introduction

CVPR2010: higher order models in computer vision: Part 3

Anticipating Human Activities using Object Affordances …hema/papers/anticipation_pami2015.pdf · Anticipating Human Activities using Object Affordances for Reactive Robotic

Recognising Human-Object Interaction via Exemplar based ...hujianfang/pdfFiles/iccvfinal.pdf · Recognising Human-Object Interaction via Exemplar based Modelling Jian-Fang Hu†,

Anticipating Human Activities using Object Affordances …pr.cs.cornell.edu/anticipation/anticipation_RSS_2013.pdf · Anticipating Human Activities using Object Affordances for Reactive

CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures

Personalized Object Recognition for Augmenting Human Memoryhosubl/WAHM16_presentation.pdf · / 16 Personalized Object Recognition for Augmenting Human Memory HOSUB LEE1, CAMERON UPRIGHT

Human Performance Evaluation of Heavy Truck Side Object

CVPR2010: Advanced ITinCVPR in a Nutshell: part 1: Introduction

Human heading judgments and object-based motion information

CVPR2010: Context-aware saliency detection