Upload
cameron-nash
View
215
Download
2
Tags:
Embed Size (px)
Citation preview
Seeing Action 1or
Representation and Recognition of Activity by Machine
Aaron [email protected]
School of Interactive Computing College of Computing
Georgia Tech
An analogy… "Once upon a time"
"Once upon a time" … computer vision researchers wanted to do Object Recognition– "This is a chair"
But this was really hard…
An analogy… "Once upon a time"
"Once upon a time" … computer vision researchers wanted to do Object Recognition– "This is a chair"
But this was really hard…
So they gave up! (until recently) and instead did…
An analogy… "Once upon a time"
"Once upon a time" … computer vision researchers wanted to do Object Recognition– "This is a chair"
But this was really hard…
So they gave up! (sort of) and instead did…
Chair recognition became model-based, object recognition based upon geometric properties.
Object Recognition ≠ Activity Recognition
Imagine you created an algorithm that could recognize only me sitting down?– Would you get a PhD? (well maybe…)
"Activity recognition"* seems to get back to semantics: – A person stealing a car– A person sitting down– Two people having a discussion (or a fight) – 2 people attacking a third– A crowd is storming the Bastille
*which we'll define more about later.
Recognition implies Representation
To recognize something you have to have something to recognize.– Brilliant huh?
Just like any other AI recognition problem, we must have a representation of whatever it is we are going to recognize.
And with representations come questions…
Some Q's about Representations
Marr's criteria:– Scope & range, sensitivity
"Language" of the representation Computability of an instance Learnability of the "class"; training versus
learning versus "the oracle" Stability in face of perceptual uncertainty Inference-support (reasoning in face of
variation or ambiguity) Others???
BUT WAIT!!!!
Just what is it we are trying to represent?
Deconstruct delivering a package… (videos)
Barnes and Noble Video Segment
What are the "activities" or "actions" taking place?
Back door openTruck arrivingBack door closingTruck leaving
Carrying a packageFollowing the car
unloading
An old story (’96):
Behavior taxonomy for vision
Different levels of understanding motion –visual evidence of behavior – require different forms of representation, methods of manipulating time, and depth of reasoning
Propose three levels: Movement - atomic behaviors defined by motion
• Ballet moves, body motions (“sitting”)
Activity - sequences or compositions• Statistically structured events
Action - semantics, causality, knowledge• Cooking, football plays, “moving Coke cans”
A better story… (anonymous paper)
(Still) three levels of understanding motion or behavior:Movement - atomic behaviors defined by motion
"Bending down", "(door) rising up", Swinging a hammer
Action – a single, semantically meaningful "event""Opening a door", "Lifting a package"Typically short in timeMight be definable in terms of motion; especially so
in a particular context.
Activity – a behavior or collection of actions with a purpose/intention.
"Delivering packages"Typically has causal underpinningsCan be thought of as statistically structured events
Thinking a bit about the levels…
(Still) Three levels of understanding motion or behavior:Movement - atomic behaviors defined by motion
"Bending down", "(door) rising up", Swinging a hammer
Action – a single, semantically meaningful "event""Opening a door", "Lifting a package"Typically short in timeMight be definable in terms of motion; especially so in a
particular context.
Activity – a behavior or collection of actions with a purpose/intention.
"Delivering packages"Typically has causal underpinningsCan be thought of as statistically structured events
Maybe Actions are movements in context??
Context
What is the goal of a representation of activity/behaviors?
Recognition implies representation
Representations can talk about what events *ARE*:– Definitional – but sometimes not “real” because primitives
not grounded– Permits specification of reasoning mechanism – Context can be made explicit (but is not usually)– Hard to learn
Representations can talk about what events *LOOK LIKE*:– Sometimes learnable, always well defined primitives– Typically not guaranteed to be complete– Have no explanatory power– Often leverages (ie is wholly dependent upon) context –
makes it learnable from specific data
Data-driven vs Knowledge-taught
Data-driven Knowledge
Statistical Structural
Movement
Activity
MHI’s
PHMM’s
SCFG’s
P-Net’s
Action BN’sPNF
Event N-gramsSuffix Trees
Temporaland
relational complexity
So how do we proceed?
Three (now only 2.5) sessions, climbing the representational ladder in terms of the semantics and representational "power“
Cover movements through activity-level descriptions.
Some structural, some statistical.
I will leave the real AI to others here…
Strict Appearance: human movements
Is recognizing movement a 3D or 2D problem? Simple human psychophysics and computational complexity argue for 2D aspects.
Temporal templates: Movements are recognized directly from the motion.
Appearance-based recognition can assist geometric recovery: recognition labels the parts and allows extraction.
demonstration
Blurry Video
Less Blurry Video!
Shape and motion: view-based
Schematic representation of sitting at 90
Motion energy images
Spatial accumulation of motion. Collapse over specific time window. Motion measurement method not critical (e.g.
motion differencing).Time
Motion history images
Motion history images are a different function of temporal volume.
Pixel operator is replacement decay:
if moving I(x,y,t) = otherwise I(x,y,t) = max(I(x,y,t-1)-1 ,0)
Trivial to construct Ik(x,y,t) from I(x,y,t) so can process multiple time window lengths without more search.
MEI is thresholded MHI
Movedt-1
Movedt-15
Temporal-templates
MEI+ MHI = Temporal template
motion history image
motion energyimage
Recognizing temporal templates
For MEI and MHI compute global properties (e.g. Hu moments). Treat both as grayscale images.
Collect statistics on distribution of those properties over people for each movement.
At run time, construct MEIs and MHIs backwards in time.– Recognizing movements as soon as they complete.
Linear time scaling.– Compute range of using the min and max of training data.
Simple recursive fomulation so very fast. Filter implementation obvious so biologically “relevant”. Best reference is PAMI 2001, Bobick and Davis
Aerobics examples
Virtual PAT (Personal Aerobics Trainer)
Uses MHI recognition Portable IR background subtraction system
(CAPTECH ‘98)
The KidsRoom
A narrative, interactive children’s playspace.
Demonstrates computer vision “action” recognition.
Someitmes, possible because the machine knows the context.
A kinder, gentler C3I interface
Ported to the Millenium Dome, London, 2001
Summary and critique in Presence, August 1999.
Recognizing Movement in the KidsRoom
First teach the kids, then observe.
Temporal templates “plus” (but in paper).
Monsters always do something, but only speak it when sure.
Some Q's about Representations…
Scope and Range:– Gross motor activities; view dependent
"Language" of the representation– Statistical characterization of video properties
Computability of an instance– Easy to computer assuming you can extract person from
background Learnability of the "class":
– Parameters predetermined by design– Explicit training; easily acquired
Stability in face of perceptual uncertainty– Pretty good
Inference-support – Zilch
Lesson from Temporal Templates:
It’s the representation, stupid…
"Gesture recognition"-like activities
Some thoughts about gesture
There is a conference on Face and Gesture Recognition so obviously Gesture recognition is an important problem…
Prototype scenario: – Subject does several examples of "each gesture" – System "learns" (or is trained) to have some sort of model
for each– At run time compare input to known models and pick one
Recently some work at Univ of Maryland on “Ballistic motions” decomposing a sequence into its parts.
Anatomy of Hidden Markov Models
Typically thought of as a stochastic FSMs where:
– aij is P(qt = j | qt-1 = i)– bj(x) is p(xt = x | qt = j)
HMMs model activity by presuming activity is a first order Markov process. Sequence is output from the bj(x) . States are hidden and unknown.
Train via expectation/maximization. (EM) (more on this…) Paradigm:
– Training: examples from each class, slow but OK.– Testing: fast (Viterbi), typical PR types of issues.– Backward looking, real-time at completion.
CBA
Tutorial on HMM?
Yes/No???
Wins and Losses of HMMs in Gesture
Good points about HMMs:– A learning paradigm that acquires spatial and
temporal models and does some amount of feature selection.
– Recognition is fast; training is not so fast but not too bad.
Not so good points:– If you know something about state definitions, difficult
to incorporate (coming later…)– Every gesture is a new class, independent of anything
else you’ve learned.– ->Particularly bad for “parameterized gesture.”
Parameterized Gesture
“I caught a fish this big.”
Parametric HMMs (PAMI, 1999)
Basic ideas:– Make output probabilities of the state be a function of the
parameter of interest, bj (x) becomes b’j (x, – Maintain same temporal properties, aij unchanged.– Train with known parameter values to solve for
dependencies of b’ on – During testing, use EM to find that gives the highest
probability. That probability is confidence in recognition; best is the parameter.
Issues:– How to represent dependence on ?– How to train given ?– How to test for ?– What are the limitations on dependence on ?
Linear PHMM - Representation
Represent dependence on as linear movement of the mean of the Gaussians of the states:
Need to learn Wj and j for each state j. (ICCV ’98)
(For the graphical model folks in the audience.)
Linear PHMM - training
Need to derive EM equations for linear parameters and proceed as normal:
Linear HMM - testing
Derive EM equations with respect to :
We are testing by EM! (i.e. iterative):– Solve for tk given guess for
– Solve for given guess for tk
How big was the fish?
Pointing
Pointing is the prototypical example of a parameterized gesture.Assuming two DOF, can parameterize either by (x,y) or by Under linear assumption must choose carefully.A generalized non-linear map would allow greater freedom. (ICCV 99)
Linear pointing results
Test for both recognition and recovery:
If prune based on legal (MAP via uniform density) :
Noise sensitivity
Compare ad hoc procedure with PHMM parameter recovery (ignoring “their” recognition problem!!).
Lesson from PHMMs:
It’s the representation, stupid…
(The non-linear case is an even better representation.)
Some Q's about Representations…
Scope and Range:– Densely sampled motions through parameter spaces
"Language" of the representation– Stochastic FSM through regions of parameter space
Computability of an instance– Only assumes consistent noise model between training and
testing Learnability of the "class":
– Explicit training; easily acquired– All parameters learned
Stability in face of perceptual uncertainty– Pretty good (see above)
Inference-support – Zilch