Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick [email protected] School of Interactive Computing College of Computing

Seeing Action 1or

Representation and Recognition of Activity by Machine

Aaron [email protected]

School of Interactive Computing College of Computing

Georgia Tech

An analogy… "Once upon a time"

"Once upon a time" … computer vision researchers wanted to do Object Recognition– "This is a chair"

But this was really hard…




So they gave up! (until recently) and instead did…




So they gave up! (sort of) and instead did…

Chair recognition became model-based, object recognition based upon geometric properties.

Object Recognition ≠ Activity Recognition

Imagine you created an algorithm that could recognize only me sitting down?– Would you get a PhD? (well maybe…)

"Activity recognition"* seems to get back to semantics: – A person stealing a car– A person sitting down– Two people having a discussion (or a fight) – 2 people attacking a third– A crowd is storming the Bastille

*which we'll define more about later.

Recognition implies Representation

To recognize something you have to have something to recognize.– Brilliant huh?

Just like any other AI recognition problem, we must have a representation of whatever it is we are going to recognize.

And with representations come questions…

Some Q's about Representations

Marr's criteria:– Scope & range, sensitivity

"Language" of the representation Computability of an instance Learnability of the "class"; training versus

learning versus "the oracle" Stability in face of perceptual uncertainty Inference-support (reasoning in face of

variation or ambiguity) Others???

BUT WAIT!!!!

Just what is it we are trying to represent?

Deconstruct delivering a package… (videos)

Barnes and Noble Video Segment

What are the "activities" or "actions" taking place?

Back door openTruck arrivingBack door closingTruck leaving

Carrying a packageFollowing the car

unloading

An old story (’96):

Behavior taxonomy for vision

Different levels of understanding motion –visual evidence of behavior – require different forms of representation, methods of manipulating time, and depth of reasoning

Propose three levels: Movement - atomic behaviors defined by motion

• Ballet moves, body motions (“sitting”)

Activity - sequences or compositions• Statistically structured events

Action - semantics, causality, knowledge• Cooking, football plays, “moving Coke cans”

A better story… (anonymous paper)

(Still) three levels of understanding motion or behavior:Movement - atomic behaviors defined by motion

"Bending down", "(door) rising up", Swinging a hammer

Action – a single, semantically meaningful "event""Opening a door", "Lifting a package"Typically short in timeMight be definable in terms of motion; especially so

in a particular context.

Activity – a behavior or collection of actions with a purpose/intention.

"Delivering packages"Typically has causal underpinningsCan be thought of as statistically structured events

Thinking a bit about the levels…

(Still) Three levels of understanding motion or behavior:Movement - atomic behaviors defined by motion

"Bending down", "(door) rising up", Swinging a hammer

Action – a single, semantically meaningful "event""Opening a door", "Lifting a package"Typically short in timeMight be definable in terms of motion; especially so in a

particular context.

Activity – a behavior or collection of actions with a purpose/intention.

"Delivering packages"Typically has causal underpinningsCan be thought of as statistically structured events

Maybe Actions are movements in context??

Context

What is the goal of a representation of activity/behaviors?

Recognition implies representation

Representations can talk about what events *ARE*:– Definitional – but sometimes not “real” because primitives

not grounded– Permits specification of reasoning mechanism – Context can be made explicit (but is not usually)– Hard to learn

Representations can talk about what events *LOOK LIKE*:– Sometimes learnable, always well defined primitives– Typically not guaranteed to be complete– Have no explanatory power– Often leverages (ie is wholly dependent upon) context –

makes it learnable from specific data

Data-driven vs Knowledge-taught

Data-driven Knowledge

Statistical Structural

Movement

Activity

MHI’s

PHMM’s

SCFG’s

P-Net’s

Action BN’sPNF

Event N-gramsSuffix Trees

Temporaland

relational complexity

So how do we proceed?

Three (now only 2.5) sessions, climbing the representational ladder in terms of the semantics and representational "power“

Cover movements through activity-level descriptions.

Some structural, some statistical.

I will leave the real AI to others here…

Strict Appearance: human movements

Is recognizing movement a 3D or 2D problem? Simple human psychophysics and computational complexity argue for 2D aspects.

Temporal templates: Movements are recognized directly from the motion.

Appearance-based recognition can assist geometric recovery: recognition labels the parts and allows extraction.

demonstration

Blurry Video

Less Blurry Video!

Shape and motion: view-based

Schematic representation of sitting at 90

Motion energy images

Spatial accumulation of motion. Collapse over specific time window. Motion measurement method not critical (e.g.

motion differencing).Time

Motion history images

Motion history images are a different function of temporal volume.

Pixel operator is replacement decay:

if moving I(x,y,t) = otherwise I(x,y,t) = max(I(x,y,t-1)-1 ,0)

Trivial to construct Ik(x,y,t) from I(x,y,t) so can process multiple time window lengths without more search.

MEI is thresholded MHI

Movedt-1

Movedt-15

Temporal-templates

MEI+ MHI = Temporal template

motion history image

motion energyimage

Recognizing temporal templates

For MEI and MHI compute global properties (e.g. Hu moments). Treat both as grayscale images.

Collect statistics on distribution of those properties over people for each movement.

At run time, construct MEIs and MHIs backwards in time.– Recognizing movements as soon as they complete.

Linear time scaling.– Compute range of using the min and max of training data.

Simple recursive fomulation so very fast. Filter implementation obvious so biologically “relevant”. Best reference is PAMI 2001, Bobick and Davis

Aerobics examples

Virtual PAT (Personal Aerobics Trainer)

Uses MHI recognition Portable IR background subtraction system

(CAPTECH ‘98)

The KidsRoom

A narrative, interactive children’s playspace.

Demonstrates computer vision “action” recognition.

Someitmes, possible because the machine knows the context.

A kinder, gentler C3I interface

Ported to the Millenium Dome, London, 2001

Summary and critique in Presence, August 1999.

Recognizing Movement in the KidsRoom

First teach the kids, then observe.

Temporal templates “plus” (but in paper).

Monsters always do something, but only speak it when sure.

Some Q's about Representations…

Scope and Range:– Gross motor activities; view dependent

"Language" of the representation– Statistical characterization of video properties

Computability of an instance– Easy to computer assuming you can extract person from

background Learnability of the "class":

– Parameters predetermined by design– Explicit training; easily acquired

Stability in face of perceptual uncertainty– Pretty good

Inference-support – Zilch

Lesson from Temporal Templates:

It’s the representation, stupid…

"Gesture recognition"-like activities

Some thoughts about gesture

There is a conference on Face and Gesture Recognition so obviously Gesture recognition is an important problem…

Prototype scenario: – Subject does several examples of "each gesture" – System "learns" (or is trained) to have some sort of model

for each– At run time compare input to known models and pick one

Recently some work at Univ of Maryland on “Ballistic motions” decomposing a sequence into its parts.

Anatomy of Hidden Markov Models

Typically thought of as a stochastic FSMs where:

– aij is P(qt = j | qt-1 = i)– bj(x) is p(xt = x | qt = j)

HMMs model activity by presuming activity is a first order Markov process. Sequence is output from the bj(x) . States are hidden and unknown.

Train via expectation/maximization. (EM) (more on this…) Paradigm:

– Training: examples from each class, slow but OK.– Testing: fast (Viterbi), typical PR types of issues.– Backward looking, real-time at completion.

CBA

Tutorial on HMM?

Yes/No???

Wins and Losses of HMMs in Gesture

Good points about HMMs:– A learning paradigm that acquires spatial and

temporal models and does some amount of feature selection.

– Recognition is fast; training is not so fast but not too bad.

Not so good points:– If you know something about state definitions, difficult

to incorporate (coming later…)– Every gesture is a new class, independent of anything

else you’ve learned.– ->Particularly bad for “parameterized gesture.”

Parameterized Gesture

“I caught a fish this big.”

Parametric HMMs (PAMI, 1999)

Basic ideas:– Make output probabilities of the state be a function of the

parameter of interest, bj (x) becomes b’j (x, – Maintain same temporal properties, aij unchanged.– Train with known parameter values to solve for

dependencies of b’ on – During testing, use EM to find that gives the highest

probability. That probability is confidence in recognition; best is the parameter.

Issues:– How to represent dependence on ?– How to train given ?– How to test for ?– What are the limitations on dependence on ?

Linear PHMM - Representation

Represent dependence on as linear movement of the mean of the Gaussians of the states:

Need to learn Wj and j for each state j. (ICCV ’98)

(For the graphical model folks in the audience.)

Linear PHMM - training

Need to derive EM equations for linear parameters and proceed as normal:

Linear HMM - testing

Derive EM equations with respect to :

We are testing by EM! (i.e. iterative):– Solve for tk given guess for

– Solve for given guess for tk

How big was the fish?

Pointing

Pointing is the prototypical example of a parameterized gesture.Assuming two DOF, can parameterize either by (x,y) or by Under linear assumption must choose carefully.A generalized non-linear map would allow greater freedom. (ICCV 99)

Linear pointing results

Test for both recognition and recovery:

If prune based on legal (MAP via uniform density) :

Noise sensitivity

Compare ad hoc procedure with PHMM parameter recovery (ignoring “their” recognition problem!!).

Lesson from PHMMs:

It’s the representation, stupid…

(The non-linear case is an even better representation.)

Some Q's about Representations…

Scope and Range:– Densely sampled motions through parameter spaces

"Language" of the representation– Stochastic FSM through regions of parameter space

Computability of an instance– Only assumes consistent noise model between training and

testing Learnability of the "class":

– Explicit training; easily acquired– All parameters learned

Stability in face of perceptual uncertainty– Pretty good (see above)

Inference-support – Zilch

Documents

Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick [email protected] School of Interactive Computing College of Computing