Learning the Appearance and Motion of People in Video

Learning the Appearance and Learning the Appearance and Motion of People in VideoMotion of People in Video

Learning the Appearance and Learning the Appearance and Motion of People in VideoMotion of People in Video

Hedvig Sidenbladh, [email protected], www.nada.kth.se/~hedvig/

Michael Black, Brown [email protected], www.cs.brown.edu/people/black/

CollaboratorsCollaborators

• David Fleet, Xerox PARC [email protected]

• Dirk Ormoneit, Stanford University [email protected]

• Jan-Olof Eklundh, KTH [email protected]

GoalGoalGoalGoal

• Tracking and reconstruction of human motion in 3D– Articulated 3D model– Monocular sequence – Pinhole camera model– Unknown, cluttered

environment

Why is it Important?Why is it Important?

• Human-machine interaction– Robots– Intelligent rooms

• Video search

• Animation, motion capture

• Surveillance

Why is it Hard?Why is it Hard?Why is it Hard?Why is it Hard?

• Structure is unobservable - inference from visible parts

• People appear in many ways - how find a model that fits them all?

Why is it Hard?Why is it Hard?Why is it Hard?Why is it Hard?

• People move fast and non-linearly

• 3D to 2D projection ambiguities

• Large occlusion• Similar appearance of

different limbs• Large search space Extreme case

Exploit cues in the images. Learn likelihood models: p(image cue | model)

Build models of human form and motion. Learnpriors over model parameters:

p(model)

Represent the posterior distribution: p(model | cue) p(cue | model) p(model)

BayesianBayesian InferenceInference

Human ModelHuman Model

• Limbs = truncated cones in 3D

• Pose determined by parameters

Bregler and Malik ‘98Bregler and Malik ‘98

State of the Art.

• Brightness constancy cue – Insensitive to appearance

• Full-body required multiple cameras

• Single hypothesis

Brightness ConstancyBrightness Constancy

I(x, t+1) = I(x+u, t) +

Image motion of foreground as a function of the 3D motion of the body.

Problem: no fixed model of appearance (drift).

t1t

Cham and Rehg ‘99Cham and Rehg ‘99

State of the Art.

I(x, t) = I(x+u, 0) +

• Single camera, multiple hypotheses

• 2D templates (no drift but view dependent)

Multiple HypothesesMultiple Hypotheses

• Posterior distribution over model parameters often multi-modal (due to ambiguities)

• Represent whole distribution:– sampled representation– each sample is a pose– predict over time using a particle

filtering approach

State of the Art.

Deutscher, North, Deutscher, North, Bascle, & Blake ‘00Bascle, & Blake ‘00

• Multiple hypotheses

• Multiple cameras

• Simplified clothing, lighting and background

State of the Art.

Sidenbladh, Black, & Fleet ‘00Sidenbladh, Black, & Fleet ‘00

• Multiple hypotheses• Monocular• Brightness constancy• Activity specific prior • Significant changes in

view and depth, template-based methods will fail

– Need a constraining likelihood model that is also invariant to variations in human appearance

How to Address the ProblemsHow to Address the Problems

– Need a good model of how people move

– Need an effective way to explore the model space (very high dimensional)

Bayesian formulation:

p(model | cue) p(cue | model) p(model)

Edge Detection?Edge Detection?

• Probabilistic model?

• Under/over-segmentation, thresholds, …

Key Idea #1Key Idea #1

• Use the 3D model to predict the location of limb boundaries in the scene.

• Compute various filter responses steered to the predicted orientation of the limb.

• Compute likelihood of filter responses using a statistical model learned from examples.

“Explain” the entire image.

p(image | foreground, background)

Generic, unknown, background

Foreground person


p(image | foreground, background)

Do not look in parts of the image considered background

Foreground part of image


p(foreground part of image | foreground)p(foreground part of image | background)

Training DataTraining DataTraining DataTraining Data

Points on limbs Points on background

Edge DistributionsEdge DistributionsEdge response steered to model edge:

),(cos),(sin),,( xxx yxe fff

Similar to Konishi et al., CVPR 99

Ridge DistributionsRidge Distributions

|),(cossin2),(sin),(cos|

|),(cossin2),(cos),(sin|),,(22

22

xxx

xxxx

xyyyxx

xyyyxxr

fff

ffff

Ridge response steered to limb orientation

Ridge response only on certain image scales!

Motion Training DataMotion Training DataMotion Training DataMotion Training Data

Motion response = I(x, t+1) - I(x+u, t)

x x+u

Motion response = temporal brightness change given model of motion = noise term in brightness constancy assumption

Motion distributionsMotion distributions

Different underlying motion models

Likelihood FormulationLikelihood Formulation

• Independence assumptions: – Cues: p(image | model) = p(cue1 | model) p(cue2 | model)

– Spatial: p(image | model) = p(image(x) | model)

– Scales: p(image | model) = p(image() | model)

• Combines cues and scales!

• Simplification, in reality there are dependencies

ximage

=1,...

LikelihoodLikelihood

pixelsbackpixelsfore

backimagepforeimagepbackforeimagep )|()|(),|(

pixelsfore

pixelsforepixelsall

backimagep

foreimagepbackimagep

)|(

)|()|(

pixelsfore

pixelsfore

backimagep

foreimagepconst

)|(

)|(

Foreground pixels

Background pixels


Step One Discussed…Step One Discussed…




Models of Human DynamicsModels of Human Dynamics

• Model of dynamics are used to propagate the sampled distribution in time

• Constant velocity model– All DOF in the model parameter space, ,

independent– Angles are assumed to change with constant speed– Speed and position changes are randomly sampled

from normal distribution

Models of Human DynamicsModels of Human Dynamics

• Action-specific model - Walking– Training data: 3D motion capture data– From training set, learn mean cycle and

common modes of deviation (PCA)

Mean cycle Small noise Large noise


Step Two Also Discussed…Step Two Also Discussed…


– Need an effective way to explore the model space (very high dimensional)



Particle FilterParticle Filter

samplesample

samplesample

normalizenormalize

Posterior )I|( 11 ttp

Temporal dynamics)|( 1ttp

Likelihood

)|I( ttp )I|( ttp

Posteri

orProblem: Expensive represententation of posterior! Approaces to solve problem: • Lower the number of samples. (Deutsher et al., CVPR00)• Represent the space in other ways (Choo and Fleet, ICCV01)

Tracking an ArmTracking an Arm

Moving camera, constant velocity model

1500 samples~2 min/frame

Self OcclusionSelf Occlusion

Constant velocity model


Walking PersonWalking Person

Walking model


#samples from 15000 to 2500by using the learned likelihood

Ongoing and Future WorkOngoing and Future Work

• Learned dynamics

• Correlation across scale

• Estimate background motion

• Statistical models of color and texture

• Automatic initialization

Lessons LearnedLessons Learned

• Probabilistic (Bayesian) framework allows– Integration of information in a principled way– Modeling of priors

• Particle filtering allows– Multi-modal distributions– Tracking with ambiguities and non-linear models

• Learning image statistics and combining cues improves robustness and reduces computation

ConclusionsConclusions

• Generic, learned, model of appearance– Combines multiple cues– Exploits work on image statistics

• Use the 3D model to predict features

• Model of foreground and background– Exploits the ratio between foreground and

background likelihood– Improves tracking

Other Related WorkOther Related Work

J. Sullivan, A. Blake, M. Isard, and J.MacCormick. Object localization by Bayesian correlation. ICCV99.

J. Sullivan, A. Blake, and J.Rittscher. Statistical foreground modelling for object localisation. ECCV00.

J. Rittscher, J. Kato, S. Joga, and A. Blake. A Probabilistic Background Model for Tracking. ECCV00.

S. Wachter and H. Nagel. Tracking of persons in monocular image sequences. CVIU, 74(3), 1999.

Documents

Learning the Appearance and Motion of People in Video