Encoding, Recognizing, Retrieving and Predicting Complex Human Actions (Computer Assisted Perception and Action Systems) Terry Caelli Computing Science

Encoding, Recognizing , Retrieving and Predicting Complex Human Actions

(Computer Assisted Perception and Action Systems)

Terry Caelli Computing Science

University of Alberta, Canada

Some Current application areas using active and passive sensing

Agriculture: Cattle Feeding

Image Understanding: Forestry Inventory Systems

Understanding Human Navigation

Map revision

Computer-based Skill Acquisition & Skill Transfer

General Aim of CAPA

To integrate sensing, MultiMedia, Animation and Machine Learning for helping humans

perform practical tasks

Problems: • How do we sense and encode human actions?

• How do we train machines to learn human actions?

• How do we recognize and predict human actions?

• How do we transfer this training from machines to humans?

Active Magnetic Field Acoustic Laser

Passive Cameras

Sensors

Computer-based Skill Acquisition Skill Transfer

• Gesture recognition

MIT Media Labs, ex. Bobick, Brand, Pentland(1995-98)

Reviews: Kohler and Schroter(1998), Aggarwal and Cai(1999)

• Skill recognition

Skill Kuniyoshi et al(1993,94)

Construction Fritsch et al(2000-2002)

Mann et al(1996)

• Sensors, encoders

- Video - 2D region/feature extraction (ex eigenvalues of ellipses, colour - Inference of 3D from 2D features, indexing 3D CAD models

Past Work: mainly video-based

Three Current Stochastic Control Models

• Kalman Filters

Single gaussian predictor-corrector model

• Hidden Markov Models

Any pdfs apply but requires careful use of EM and Viterbi

• Particle Filters and Markov Chain Monte Carlo methods

Uses interest sampling approaches, generalizes HMMs, issues of priors, proposal, sampling distributions

WHAT IS LEFT TO DO?

• Apply to tasks that really have the potential to assist humans• Improve sensors• Improve encoding models• Improve learning/estimation, recognition and prediction methods• Develop more objective methods for evaluating models

THE ENCODING PROBLEM Encoding Human Kinematics

Each sensor or image feature records position, velocity and inferred forces, angles between joints, etc

Must be fully 3D or inferred 3D from sensed 2D data

For active sensors the basic signal is a contour trajectory

))(),(),(()( tztytxtX iiii Encoder options (2) Transform sets of sensor trajectories into inter-sensor angles, etc• Encode each sensor trajectory and their correlations <= used here

SHAPE: unique and invariant The Serret-Frenet equations

Being INTRINSIC properties they are NOT localized in space or in absolute orientation

ndb.

Curvature how the tangent (T) changeswith respect to ds

Torsion how binormal (B) changes relative to the normal (N):with respect to ds: “amount of screw”

dT = kdN

dN= -kdT +dB

dB = -dN

Encoding TRAJECTORY SHAPE

Curvature and torsion can be directly computed by

3)(

)()()()()()()()()()()()()(

tC

txtytytxtxtztztxtytztztyt

t

tttttttttttttttttt

)()()(

)()()(

)()()(

)()()(

)(222 tGtFtE

tztytx

tztytx

tztytx

t ttttttttt

tttt

ttt

)()(

)()()(

)()(

)()()(

)()(

)()()(

tytx

tytxtG

txtz

txtztF

tzty

tztytE

tttt

tt

tttt

tt

tttt

tt

where

Computing Invariant features

Past filters: Mokhtarian(1997) – gaussians NOT optimal

One optimal solution: Savitsky Golay (SG) FiltersOne optimal solution: Savitsky Golay (SG) Filters

A linear least squares filterA linear least squares filter

• Choose window (causal, non-causal)Choose window (causal, non-causal)• Choose order of polynomialChoose order of polynomial• Derives filter kernels to Derives filter kernels to best fit data to polynomialbest fit data to polynomial

Solution for coefficients: a=(ATA)-1(Af) = (ATA)-1A)f So we have a filter form (moving window)(ATA)-1(Aen)

A is “design” matrix:data x polynomial bases

Known properties •Robust to noise•Fits data using higher-order moments

•Determines derivatives analytically

MOST IMPORTANT: 2 scalesMOST IMPORTANT: 2 scales

window size and order of polynomial (moments) for solving linear least window size and order of polynomial (moments) for solving linear least square problem LOCALLY IN EACH MOVING WINDOWsquare problem LOCALLY IN EACH MOVING WINDOW

2

1

1

)(

min

N

i i

M

k

ikki

a

tXay

i

ijij

tXA

)(

Example: coefficients for window size +/- 4:

{0.04, -0.13, 0.07, 0.32, 0.42, 0.32, 0.07, –0.13, 0.04}

Typical results (right hand side) for SG(20,20,4) filter

3D velocity and acceleration with respect to intrinsic arc length parameter u

dt

ds

ds

sdVtA

dt

ds

ds

sdXtV

)()(

)()(

Encoding Encoding TRAJECTORYTRAJECTORY DYNAMICSDYNAMICS

V

A

2D Curvature-torsion space2D Curvature-torsion space

2D V-A space2D V-A space

3D recorded action3D recorded action

Complete Invariant signature

SHAPE: k(u) and (u) for each point

2) DYNAMICS: V(u), A(u) for each point (total - NOT directional)

3) The initial position and direction difference vectors

Forward Model: Dynamics-to-Shape

Inverse Model: Shape-to-Dynamics

..can be multi-scaled with a symbolic representation

The Point Screw Decomposition ModelThe Point Screw Decomposition Model

Essential motion parsing idea Essential motion parsing idea

The helix X(t) = (acos(t), asin(t), bt) is uniquely defined by where =a/(a2 +b2); =b/(a2 + b2)

so, if we cluster/quantize {} signatures we are generating a screw approximation to a curve

Solutions to learning, recognition and prediction of screw sequences

Markov shape theory: stochastic differential geometry

How do we train machines to learn human actions?

How do we recognize and predict human actions?

dT = kdN

dN= -kdT +dB

dB = -dN

Sensory-motor programs for complex actions involve storing simply dynamical stochastic rules about screw action sequences which include what should be sensed (observed) during their executions

Key idea

))1(),1,()),(),((())1(),1(( tottAttSttS

An adaptive generalization version of the Serret-Frenet

Current Solutions - Cartesian Product (complete model) O(TN2C) - Structured Mean Field: Markov random field

compute over cliques (Ghahramani and Jordon 1996) O(TN2C*) - N-heads (Brand, 1997) : compute most likely” O(T(CN)2) - Weighted Marginals Model (Caelli et al, 2001, Zhong and Ghosh, 2001) O(T(CN))

The Problem: The causal model component )...,/( 211

Cttt

ct SSSSP

is exponential!

},..1;,..1);({

},,,{

iiu

iu NuNiSpwhere

BAC

for N HMM’s with Ni states (actions) for each HMM: the prior probability of each state of each HMM.

C corresponds to the coefficient of interaction A to the intra and inter state transition matrix

B to the state dependent observation probability matrix

)(/)1(( tStSpaA iu

jv

ijuv

t

t+1A

intra

inter B

)/()( iu

ik

iku SopobB

iuS )/( i

uik Sop

iko

WMM: Coupled hidden Markov model

WMM: Generalized Viterbi

))((]))((max)([maxarg)(

))((]))((max)([max)(

1);(maxarg)1(

))((])()([maxarg)(

))((])()([max)(

1;1;2

Re

)(

))1(()()(

1;1

11

11

,

11

,

11

1

11

tobacwacuv

tobacwacuv

Ttvtu

trackingBack

tobacwacuv

tobacwacuv

NvNjTtFor

cursion

ou

obuS

NuNiFor

iv

jiwvji

ij

jt

jw

iiuvii

itu

it

iv

jiwvji

ij

jt

jw

iiuvii

itu

it

itv

i

iv

jiwvji

ijw

jt

iiuvii

itu

it

iv

jiwvji

ijw

jt

iiuvii

itu

it

j

i

iu

iiu

i

Produces most likely sequences of states given the observations and CHMM model

N-Heads

Forward operator

In matrix form we have

)1(1 tbCAtt

The corresponding backward operator is

)(''1 tbACtt

t1 t2 tT

HMMi

HMMj

Forward

Backward

iT

ii ooo 11

jT

jj ooo 11

Produces an update of the CHMM model given the

set of observations

WMM Estimation: Generalized Baum Welch

))1(()()()(

))1(()()()(

:

))1(()/)1(()()()(

;;

1

;;

1

11

tobawcaucu

tobawcaucu

Induction

obSopSpuS

iv

jiw

ijuv

jtij

jiu

iiuv

itii

it

iv

jiw

ijuv

jtij

jiu

iiuv

itii

it

iu

iu

iu

iiu

iiu

WMM

N-Heads

Examples of tasksDrain

Spider

3 participants, 4 tasks

assembly, disassembly for “drain”, “spider” approximately 25 seconds duration each

repeated 10 times each

Observation sequences 230-260 in length

Initial Estimates from training data

2 4 9 15 21 294

16

36

-0.050.000.050.100.150.200.250.300.350.40

Co

ns

tra

ine

d M

on

t C

arl

o

Ha

mm

ing

Dis

tan

ce

# of Observations

# of States

Task Recognition and Discrimination P(Hits)-P(False Alarms)

Deriving the best number of observation symbols and hidden states from task samples

Use of 16 observation symbols and 16 states

Recognizing actions

3 (participants) x 4 (tasks) => 12 Coupled HMM models

5 unseen new samples/participant/task => 60 tests

In all cases

log P(Viterbi/correct training data from 4 sensors) >

log P (Viterbi/incorrect test data from 4 sensors)

100% correct identification on each and every component HMM – simply implies subject performance and tasks were quite different!

Predicting actions Viterbi solution using Monte Carlo sampling and probability correct (PC) measure

Coupling coefficient within an arm: (C) 0.0 0.1 0.6 Training Test Training Test Training Test

AM 0.66 0.60 0.72 0.68 0.67 0.63Task 1 JF 0.67 0.63 0.75 0.70 0.68 0.66

TC 0.67 0.65 0.75 0.73 0.73 0.72

AM 0.63 0.62 0.71 0.71 0.71 0.70Task 2 JF 0.65 0.65 0.73 0.72 0.72 0.72

TC 0.67 0.65 0.73 0.71 0.70 0.68

AM 0.66 0.65 0.75 0.75 0.71 0.70Task 3 JF 0.65 0.68 0.73 0.74 0.70 0.68

TC 0.68 0.69 0.74 0.74 0.75 0.75

AM 0.62 0.61 0.68 0.70 0.69 0.69Task 4 JF 0.62 0.62 0.65 0.64 0.65 0.66

TC 0.62 0.58 0.70 0.70 0.67 0.68

Means: 0.65 0.60 0.67 0.71 0.65 0.69

• Long sequences – regimes problem?

• Wrong Model – CHMM too limited?

• Too much uncertainty?

• How to tell?

A12 A21

A1

A2

B1

B2

Typical model uncertainty

Why did the model correctly recognize but not predict as well as we wanted?

We need measures of model parameters with respect to

recognition and prediction

beyond MAP

Condition Number Residual analysis

BAba jkij . Plays key role in estimation and prediction

So, consider augmented matrix BA |

• Rows define state and observation attributes of each state

• Dependent rows indicate redundant states

• Compute inverse condition number: singular values

• Use residuals to delete, merge or split states/observations

Example

maxmin /

The model (NB: state priors from first eigenvector of A)

…may represent the data but has 0 inverse condition number!

5.5.

5.5.;

2.8.

2.8.BA

Conditional Entropy (Information Content)

Given a model and observation sequence we can compute:

(1) The Viterbi optimal state sequence given the complete model

(2) The optimal state sequence given the B matrix and priors:

)}()/({max)/( SpSOpOSp tS

tt Bayesian MAP Classifier

(3) Compute the conditional entropy from the two sets of state sequences

)/()()/(

)(),()/(

BVHVHBVR

BHBVHBVH

The amount of information in the Viterbi solution explained by the Bayesian classifier

The amount of information in the Viterbi solution not explained by the Bayesian classifier: The pure Markov component

Example – Recovery of 3D Hand movements from images

Almost never are the HMM model parameters published so we don’t know if the appropriate model is really a HMM or simply a Markov Chain or Bayesian classifier

5.5.

5.5.;

7.3.

2.8.BA

8.2.

3.7.;

5.5.

5.5.BA

Markov Chain

Bayesian ClassifierSo, for hand motion recognition/tracking NO HMM would use the Markov property with random movementsNO HMM would use the B matrix is features were ambiguous

Experiments

Deterministic Walk

Random Walk

Random Pose Poses:

5 pitch {-30,0,20,50,80}

5 roll {-90,-45,0,45,90}

4 Yaw {-20,-10,0,10}

100 possible poses

0.46 0.54 0.00 0.00 0.00

0.45 0.00 0.55 0.00 0.00

0.00 0.53 0.00 0.47 0.00

0.00 0.00 0.46 0.00 0.54

0.00 0.00 0.00 0.55 0.45

A = B =

0.74 0.26 0.00 0.00 0.0

0.90 0.10 0.00 0.00 0.0

0.67 0.33 0.00 0.00 0.0

0.89 0.05 0.04 0.03 0.0

0.86 0.14 0.00 0.00 0.0

Residuals = [ 0.4 0.4 0.5 0.4 0.4 0.1 0.1 0.0 0.0 0.0 ]States Observation symbols

= 0.13

After refinement - H(v): 2.32 H(v/b): 1.59 R(v / b): 0.73 69% of the information in the A matrix in predicting the optimal

state sequence.

MAP as a function of model uncertainty uncertainty

Initial model: Random

Initial Model: Deterministic

Shows how MAP breaks down as model becomes more random

Conclusions

(1) There are many uses for computer assisted perception and action systems

(2) Prototyping human actions can be used for teaching and assessing and transferring human skill, wellness, etc.

(3) Issues of sensors, encoders are still open for development: robustness is an issue

(4) Models are also open: Kalman, HMMs, ARMA, Particle filters are all useful BUT complete understanding and assessment of how model parameters are functioning is critical for design issues

Documents

Encoding, Recognizing, Retrieving and Predicting Complex Human Actions (Computer Assisted Perception and Action Systems) Terry Caelli Computing Science