48
Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick [email protected] School of Interactive Computing College of Computing Georgia Tech

Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick [email protected] School of Interactive Computing College of Computing

Embed Size (px)

Citation preview

Page 1: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Seeing Action 1or

Representation and Recognition of Activity by Machine

Aaron [email protected]

School of Interactive Computing College of Computing

Georgia Tech

Page 2: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

An analogy… "Once upon a time"

"Once upon a time" … computer vision researchers wanted to do Object Recognition– "This is a chair"

But this was really hard…

Page 3: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing
Page 4: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

An analogy… "Once upon a time"

"Once upon a time" … computer vision researchers wanted to do Object Recognition– "This is a chair"

But this was really hard…

So they gave up! (until recently) and instead did…

Page 5: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing
Page 6: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

An analogy… "Once upon a time"

"Once upon a time" … computer vision researchers wanted to do Object Recognition– "This is a chair"

But this was really hard…

So they gave up! (sort of) and instead did…

Chair recognition became model-based, object recognition based upon geometric properties.

Page 7: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Object Recognition ≠ Activity Recognition

Imagine you created an algorithm that could recognize only me sitting down?– Would you get a PhD? (well maybe…)

"Activity recognition"* seems to get back to semantics: – A person stealing a car– A person sitting down– Two people having a discussion (or a fight) – 2 people attacking a third– A crowd is storming the Bastille

*which we'll define more about later.

Page 8: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Recognition implies Representation

To recognize something you have to have something to recognize.– Brilliant huh?

Just like any other AI recognition problem, we must have a representation of whatever it is we are going to recognize.

And with representations come questions…

Page 9: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Some Q's about Representations

Marr's criteria:– Scope & range, sensitivity

"Language" of the representation Computability of an instance Learnability of the "class"; training versus

learning versus "the oracle" Stability in face of perceptual uncertainty Inference-support (reasoning in face of

variation or ambiguity) Others???

Page 10: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

BUT WAIT!!!!

Just what is it we are trying to represent?

Deconstruct delivering a package… (videos)

Page 11: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Barnes and Noble Video Segment

Page 12: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

What are the "activities" or "actions" taking place?

Back door openTruck arrivingBack door closingTruck leaving

Carrying a packageFollowing the car

unloading

Page 13: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

An old story (’96):

Behavior taxonomy for vision

Different levels of understanding motion –visual evidence of behavior – require different forms of representation, methods of manipulating time, and depth of reasoning

Propose three levels: Movement - atomic behaviors defined by motion

• Ballet moves, body motions (“sitting”)

Activity - sequences or compositions• Statistically structured events

Action - semantics, causality, knowledge• Cooking, football plays, “moving Coke cans”

Page 14: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

A better story… (anonymous paper)

(Still) three levels of understanding motion or behavior:Movement - atomic behaviors defined by motion

"Bending down", "(door) rising up", Swinging a hammer

Action – a single, semantically meaningful "event""Opening a door", "Lifting a package"Typically short in timeMight be definable in terms of motion; especially so

in a particular context.

Activity – a behavior or collection of actions with a purpose/intention.

"Delivering packages"Typically has causal underpinningsCan be thought of as statistically structured events

Page 15: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Thinking a bit about the levels…

(Still) Three levels of understanding motion or behavior:Movement - atomic behaviors defined by motion

"Bending down", "(door) rising up", Swinging a hammer

Action – a single, semantically meaningful "event""Opening a door", "Lifting a package"Typically short in timeMight be definable in terms of motion; especially so in a

particular context.

Activity – a behavior or collection of actions with a purpose/intention.

"Delivering packages"Typically has causal underpinningsCan be thought of as statistically structured events

Maybe Actions are movements in context??

Context

Page 16: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

What is the goal of a representation of activity/behaviors?

Recognition implies representation

Representations can talk about what events *ARE*:– Definitional – but sometimes not “real” because primitives

not grounded– Permits specification of reasoning mechanism – Context can be made explicit (but is not usually)– Hard to learn

Representations can talk about what events *LOOK LIKE*:– Sometimes learnable, always well defined primitives– Typically not guaranteed to be complete– Have no explanatory power– Often leverages (ie is wholly dependent upon) context –

makes it learnable from specific data

Page 17: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Data-driven vs Knowledge-taught

Data-driven Knowledge

Statistical Structural

Movement

Activity

MHI’s

PHMM’s

SCFG’s

P-Net’s

Action BN’sPNF

Event N-gramsSuffix Trees

Temporaland

relational complexity

Page 18: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

So how do we proceed?

Three (now only 2.5) sessions, climbing the representational ladder in terms of the semantics and representational "power“

Cover movements through activity-level descriptions.

Some structural, some statistical.

I will leave the real AI to others here…

Page 19: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Strict Appearance: human movements

Is recognizing movement a 3D or 2D problem? Simple human psychophysics and computational complexity argue for 2D aspects.

Temporal templates: Movements are recognized directly from the motion.

Appearance-based recognition can assist geometric recovery: recognition labels the parts and allows extraction.

demonstration

Page 20: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Blurry Video

Page 21: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Less Blurry Video!

Page 22: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Shape and motion: view-based

Schematic representation of sitting at 90

Page 23: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Motion energy images

Spatial accumulation of motion. Collapse over specific time window. Motion measurement method not critical (e.g.

motion differencing).Time

Page 24: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Motion history images

Motion history images are a different function of temporal volume.

Pixel operator is replacement decay:

if moving I(x,y,t) = otherwise I(x,y,t) = max(I(x,y,t-1)-1 ,0)

Trivial to construct Ik(x,y,t) from I(x,y,t) so can process multiple time window lengths without more search.

MEI is thresholded MHI

Movedt-1

Movedt-15

Page 25: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Temporal-templates

MEI+ MHI = Temporal template

motion history image

motion energyimage

Page 26: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Recognizing temporal templates

For MEI and MHI compute global properties (e.g. Hu moments). Treat both as grayscale images.

Collect statistics on distribution of those properties over people for each movement.

At run time, construct MEIs and MHIs backwards in time.– Recognizing movements as soon as they complete.

Linear time scaling.– Compute range of using the min and max of training data.

Simple recursive fomulation so very fast. Filter implementation obvious so biologically “relevant”. Best reference is PAMI 2001, Bobick and Davis

Page 27: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Aerobics examples

Page 28: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Virtual PAT (Personal Aerobics Trainer)

Uses MHI recognition Portable IR background subtraction system

(CAPTECH ‘98)

Page 29: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

The KidsRoom

A narrative, interactive children’s playspace.

Demonstrates computer vision “action” recognition.

Someitmes, possible because the machine knows the context.

A kinder, gentler C3I interface

Ported to the Millenium Dome, London, 2001

Summary and critique in Presence, August 1999.

Page 30: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Recognizing Movement in the KidsRoom

First teach the kids, then observe.

Temporal templates “plus” (but in paper).

Monsters always do something, but only speak it when sure.

Page 31: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Some Q's about Representations…

Scope and Range:– Gross motor activities; view dependent

"Language" of the representation– Statistical characterization of video properties

Computability of an instance– Easy to computer assuming you can extract person from

background Learnability of the "class":

– Parameters predetermined by design– Explicit training; easily acquired

Stability in face of perceptual uncertainty– Pretty good

Inference-support – Zilch

Page 32: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Lesson from Temporal Templates:

It’s the representation, stupid…

Page 33: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

"Gesture recognition"-like activities

Page 34: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Some thoughts about gesture

There is a conference on Face and Gesture Recognition so obviously Gesture recognition is an important problem…

Prototype scenario: – Subject does several examples of "each gesture" – System "learns" (or is trained) to have some sort of model

for each– At run time compare input to known models and pick one

Recently some work at Univ of Maryland on “Ballistic motions” decomposing a sequence into its parts.

Page 35: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Anatomy of Hidden Markov Models

Typically thought of as a stochastic FSMs where:

– aij is P(qt = j | qt-1 = i)– bj(x) is p(xt = x | qt = j)

HMMs model activity by presuming activity is a first order Markov process. Sequence is output from the bj(x) . States are hidden and unknown.

Train via expectation/maximization. (EM) (more on this…) Paradigm:

– Training: examples from each class, slow but OK.– Testing: fast (Viterbi), typical PR types of issues.– Backward looking, real-time at completion.

CBA

Page 36: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Tutorial on HMM?

Yes/No???

Page 37: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Wins and Losses of HMMs in Gesture

Good points about HMMs:– A learning paradigm that acquires spatial and

temporal models and does some amount of feature selection.

– Recognition is fast; training is not so fast but not too bad.

Not so good points:– If you know something about state definitions, difficult

to incorporate (coming later…)– Every gesture is a new class, independent of anything

else you’ve learned.– ->Particularly bad for “parameterized gesture.”

Page 38: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Parameterized Gesture

“I caught a fish this big.”

Page 39: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Parametric HMMs (PAMI, 1999)

Basic ideas:– Make output probabilities of the state be a function of the

parameter of interest, bj (x) becomes b’j (x, – Maintain same temporal properties, aij unchanged.– Train with known parameter values to solve for

dependencies of b’ on – During testing, use EM to find that gives the highest

probability. That probability is confidence in recognition; best is the parameter.

Issues:– How to represent dependence on ?– How to train given ?– How to test for ?– What are the limitations on dependence on ?

Page 40: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Linear PHMM - Representation

Represent dependence on as linear movement of the mean of the Gaussians of the states:

Need to learn Wj and j for each state j. (ICCV ’98)

(For the graphical model folks in the audience.)

Page 41: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Linear PHMM - training

Need to derive EM equations for linear parameters and proceed as normal:

Page 42: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Linear HMM - testing

Derive EM equations with respect to :

We are testing by EM! (i.e. iterative):– Solve for tk given guess for

– Solve for given guess for tk

Page 43: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

How big was the fish?

Page 44: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Pointing

Pointing is the prototypical example of a parameterized gesture.Assuming two DOF, can parameterize either by (x,y) or by Under linear assumption must choose carefully.A generalized non-linear map would allow greater freedom. (ICCV 99)

Page 45: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Linear pointing results

Test for both recognition and recovery:

If prune based on legal (MAP via uniform density) :

Page 46: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Noise sensitivity

Compare ad hoc procedure with PHMM parameter recovery (ignoring “their” recognition problem!!).

Page 47: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Lesson from PHMMs:

It’s the representation, stupid…

(The non-linear case is an even better representation.)

Page 48: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing

Some Q's about Representations…

Scope and Range:– Densely sampled motions through parameter spaces

"Language" of the representation– Stochastic FSM through regions of parameter space

Computability of an instance– Only assumes consistent noise model between training and

testing Learnability of the "class":

– Explicit training; easily acquired– All parameters learned

Stability in face of perceptual uncertainty– Pretty good (see above)

Inference-support – Zilch