10
Computer Science and Artificial Intelligence Laboratory Conditional Random People: Tracking Humans with CRFs and Grid Filters Leonid Taycher, Gregory Shakhnarovich, David Demirdjian, Trevor Darrell Technical Report massachusetts institute of technology, cambridge, ma 02139 usa — www.csail.mit.edu December 1, 2005 MIT-CSAIL-TR-2005-079 AIM-2005-034

Conditional Random People: Tracking Humans with CRFs and ...publications.csail.mit.edu/tmp/MIT-CSAIL-TR-2005-079.pdf · Conditional Random People: Tracking Humans with CRFs ... real-time

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Conditional Random People: Tracking Humans with CRFs and ...publications.csail.mit.edu/tmp/MIT-CSAIL-TR-2005-079.pdf · Conditional Random People: Tracking Humans with CRFs ... real-time

Computer Science and Artificial Intelligence Laboratory

Conditional Random People: Tracking Humanswith CRFs and Grid FiltersLeonid Taycher, Gregory Shakhnarovich, David Demirdjian, Trevor Darrell

Technical Report

m a s s a c h u s e t t s i n s t i t u t e o f t e c h n o l o g y, c a m b r i d g e , m a 0 213 9 u s a — w w w. c s a i l . m i t . e d u

December 1, 2005MIT-CSAIL-TR-2005-079AIM-2005-034

Page 2: Conditional Random People: Tracking Humans with CRFs and ...publications.csail.mit.edu/tmp/MIT-CSAIL-TR-2005-079.pdf · Conditional Random People: Tracking Humans with CRFs ... real-time

Conditional Random People: Tracking Humans with CRFsand Grid Filters

Leonid Taycher†, Gregory Shakhnarovich‡, David Demirdjian†, and Trevor Darrell†

† Computer Science and Artificial Intelligence Laboratory,Massachusetts Institute of Technology,

Cambridge, MA 02139

‡ Department of Computer Science,Brown University,

Providence, RI 02912{lodrion, demirdji, trevor}@csail.mit.edu [email protected]

AbstractWe describe a state-space tracking approach based on aConditional Random Field (CRF) model, where the obser-vation potentials are learned from data. We find functionsthat embed both state and observation into a space wheresimilarity corresponds to L1 distance, and define an obser-vation potential based on distance in this space. This po-tential is extremely fast to compute and in conjunction witha grid-filtering framework can be used to reduce a contin-uous state estimation problem to a discrete one. We showhow a state temporal prior in the grid-filter can be com-puted in a manner similar to a sparse HMM, resulting inreal-time system performance. The resulting system is usedfor human pose tracking in video sequences.

1 IntroductionTracking articulated objects (such as humans) is an exam-ple of state estimation in a high-dimensional space with anon-linear observation model that has been a focus of con-siderable research attention. The combination of frequentself-occlusion and unobservable degrees of freedom withthe large volume of the pose space make probabilistic meth-ods appealing. The vast majority of probabilistic articulatedtracking methods are based on a generative model formula-tion.

Current state-of-the-art generative tracking algorithmsuse non-parametric density estimators, such as particle fil-ters, due to their ability to model arbitrary multimodal dis-tributions [18, 10]. Unfortunately, several properties con-spire to make particle filtering extremely computationallyintensive. On one hand, a large number of particles isneeded in order to faithfully model the distributions in ques-tion. On the other hand, a complex likelihood functionneeds to be evaluated for each particle at every iterationof the algorithm. A further drawback of generative-model

based algorithms is that the likelihood function is too com-plicated to be learned from data and is usually specified inan ad-hoc fashion. Recently, the use of directed discrimina-tive models with parameters learned directly from data havebeen proposed [1, 27, 22].

In this work we pose state estimation as inference in anundirected Conditional Random Field model (CRF) [17].This allows us to replace the likelihood function with amore general observation potential (compatibility) functionthat can be automatically learned from training data. Thesefunctions might be expensive to evaluate in general, but canbe made efficient at run-time if all state (pose) values atwhich they can be evaluated are known in advance. In thiscase much of the computation can be performed off-line,thus greatly reducing run-time complexity.

This algorithm naturally operates on a discrete set ofsamples, and we will show how we can estimate the pos-terior probability in a continuous state space using grid fil-tering methods. The idea underlying these methods is that ifthe state-space can be partitioned into regions that are smallthen the posterior can be well approximated by a constantfunction within each region.

The direct application of grid filtering would, of course,result in the need to evaluate the potential function in eachregion in the partition, which is impossible to do in reason-able time even with fast implementation. Fortunately this isnot necessary, since at a particular time-step, the prior stateprobability would be negligible in the vast majority of theregions, allowing us to concentrate only on locations with asignificant prior.

Our algorithm operates in a standard predict-updateframework: at every step we first estimate the temporalprior probability of the state being in each of the regionsin the partition. We then evaluate the observation potentialonly for regions with non-negligible prior. When the set ofcells is fixed, we can precompute the transition probabili-ties between cells, and thus reduce the temporal prior com-

1

Page 3: Conditional Random People: Tracking Humans with CRFs and ...publications.csail.mit.edu/tmp/MIT-CSAIL-TR-2005-079.pdf · Conditional Random People: Tracking Humans with CRFs ... real-time

putation to a single sparse matrix/vector multiplication, ina manner similar to HMMs [20], thus avoiding a samplingstep altogether.

After reviewing related prior work, we first describe theCRF-based tracking formulation and describe a way to learna particular observation potential function based on imageembedding (Section 3). We then discuss a grid-filter-basedinference method which can be realized with a sparse HMMcomputation (Section 4). The results of our method aredemonstrated and compared against competing algorithmsin Section 5.

2 Prior WorkProbabilistic articulated pose estimation is often ap-proached using state-space methods. The majority of theapproaches have been based on a generative model formu-lation, with varying assumptions about the forms of thepose distribution and transition probabilities. Early meth-ods [14, 21] assumed that both were Gaussian and usedKalman filtering. Extended and Unscented [28] Kalmanfilters enabled modeling of non-linear transitions, but stillconstrained pose distribution to be Gaussian. These meth-ods required a relatively small number of evaluations of thelikelihood function, but lost track due to restrictive distribu-tion models.

The need to relax the unimodality assumption led firstto use of mixture models [11, 5], and then to Monte-Carlomethods that represent distributions with sets of discretesamples (particles) [18, 10, 25, 26]. While theoreticallysound, particle filtering methods are not very successful inhigh dimensions [15] – they require large numbers of par-ticles to faithfully represent the distribution, which entailslarge computational costs of likelihood evaluation. Further-more, the emission probability model used in likelihoodevaluation is very expensive to train, and is often hand-designed in an ad-hoc fashion.

Several discriminative methods have been proposed forvisual pose tracking. These algorithms apply various re-gression techniques while leveraging large number of anno-tated image sequences. For example, one [1], or a mixture[27] of simple experts were trained to predict current posebased on the past pose estimates and the current observa-tion. Robust regression combined with fast nearest neighborsearch was used for single frame pose estimation in [23].

In this paper we dispense with directed models alto-gether and opt for a Conditional Random Field (CRF) [17]model. The main advantage of this model over generativemodels is that CRFs do not require specification (and eval-uation) of the emission probability, but only similarity be-tween state and observation(s). CRFs are also a more flexi-ble model than the previously proposed regression methods.They allow for modeling the relationship between the state

and an arbitrary subset of observations. They are also bet-ter able to adjust to sequences not appearing in the trainingdata. For example the MEMM model (similar to one usedin [27]) has been shown to be subject to label bias problem[17].

While in the present work we use a simple chain-structured CRF (Figure 1(b)), which directly models the de-pendency between concurrent state and observation, it canbe extended by introducing more general relationships be-tween state and observations.

We learn the observation potential function for ourmodel using the parameter sensitive embedding introducedin [23]. This algorithm allows us to learn a transformationof images of humans into a space where the distance be-tween embeddings of two images is likely to be small ifthe poses are similar and large otherwise. The observationpotential of a particular pose is then determined by the dis-tance between embeddings of the rendering of the figure inthis pose and the observed image.

If for every pose at which we would like to evaluate thepotential we had to render the corresponding image, ourmethod would be extremely slow. By discretizing the con-tinuous pose space, we are able to precompute the embed-dings of all discrete poses off-line, thus drastically reducingrun-time complexity. Fixing the set of poses at which ob-servation potential can be computed would seem to be anunreasonable restriction, since we are operating in a contin-uous pose space, but we overcome this problem by using avariant of the grid-filtering technique [6, 2].

The main idea underlying grid filtering is that sufficientlydiscretized random variable is indistinguishable from a con-tinuous one. That is, if the distribution can approximated bya piece-wise constant function, then it is sufficient to evalu-ate it only at one point in every “constancy region” (cell) [6].This reduces a continuous estimation problem to a discreteone (albeit with very large number of discrete points). Weshow that in the case where both observation potential andthe temporal prior are constant in each cell, tracking can beformulated as state estimation in an HMM framework, al-lowing us to use existing inference algorithms; further, wehave found in practice that a manageable number of cellssuffices for realistic tracking tasks.

3 Tracking with Conditional Ran-dom Fields

Figure 1(a) shows the dynamic generative model that iscommonly used in tracking applications. The state (pose1)at time t is denoted as θt, and the observed images as It.The full model is specified by the initial distribution p(θ0),

1In this work we consider only first order Markov models of motion.

2

Page 4: Conditional Random People: Tracking Humans with CRFs and ...publications.csail.mit.edu/tmp/MIT-CSAIL-TR-2005-079.pdf · Conditional Random People: Tracking Humans with CRFs ... real-time

θt+1θtθt−1

I t+1I tI t−1 I t+1I t−1 I t

θt+1θt−1 θt θt+1

I t+1I t−1 I t

θt−1 θt

(a) (b) (c)

Figure 1: Chain-structured generative (a), CRF (b), and MEMM (c) tracking models. In all models the state of the object(pose) at time t is specified by θt, and the observed image by It. The generative model is described by transition probabilityp(θt|θt−1) and the emission probability p(It|θt). The CRF model is described by motion compatibility (potential) functionφ(θt, θt−1) and the image compatibility function φt

o(θt) = φ(It, θt). Note the contrast with the MEMM model [27], specified

by the conditional distribution p(θt|θt−1, It) as shown in (c).

the transition probability model p(θt|θt−1), and the emis-sion distribution p(It|θt). This model describes the jointprobability of the state(s) and observation(s)

p(θ0..T , I1..T ) = p(θ0)T∏

t=1

[p(θt|θt−1)p(It|θt)],

from which appropriate conditional distributions of the poseparameters can be derived.

While reasonable approximations can be constructedfor the transition probability, p(θt|θt−1), the problem forgenerative models lies in specifying the emission modelp(It|θt). In practice, to evaluate the likelihood function ata particular pose, a figure in this pose is first rendered asan image, and this image is then compared with the obser-vation using a certain metric[13]. Evaluating the likelihoodthus becomes computationally expensive.

The major difference between generative-model basedapproaches and ours is that we formulate pose estimationas inference in a Conditional Random Field (CRF) model,and are able to learn a compact and efficient observationand transition potentials from data.

A chain version of a CRF is shown in Figure 1(b).While, apart from the lack of arrows, it is quite similarto the generative model, the underlying computations arequite different. This model is specified by the motion po-tential φ(θt, θt−1) and the observation potential φt

o(θt) =

φ(It, θt). The observation potential function is the mea-sure of compatibility between the latent state and the obser-vation. Of course, one choice for it might be the genera-tive model’s emission probability p(It|θt), but this does nothave to be the case. It can be modeled by any function that islarge when the latent pose corresponds to the one observedin the image and small otherwise.

Rather than modeling the joint distribution of poses andobservations, the CRF directly models the distribution ofposes conditioned on observation,

p(θ0..T |I1..T ) =1Z

p(θ0)T∏

t=1

[φ(θt, θt−1)φto(θ

t)],

where Z is a normalization constant.

Once the observation potential is defined, a chain-structured CRF2 can be used to perform on-line tracking

p(θt|I1..T ) ∝ φto(θ

t)

Zφ(θt, θt−1)p(θt−1|I1..t−1)dθt−1. (1)

The main advantage of this model from our standpoint isthat the observation potential φt

o(θt) may be significantly

simpler to learn and faster to evaluate than the emissionprobability p(It|θt). Below we describe an model of suchpotential based on similarity between images.

Suppose that we can measure the similarity S such that,given two images Ia and Ib with underlying poses θa andθb, respectively, S(Ia, Ib) is with high probability small ifdθ(θa, θb) is small, and large otherwise.3 Suppose now thatwe are interested in evaluating the potential φ(It, θ), andthat we have access to an image Iθ that corresponds to thepose θ (for instance, we can render it using computer graph-ics). Then, we can define the observation potential based ondistance in the image embedding space:

φ(It, θ) = N(S(It, Iθ); 0, σ2). (2)

In this work, we follow the approach in [23] for learn-ing a binary embedding H(I) of images such that the L1

distance in the H space serves as a proxy for such a pose-sensitive similarity S. Briefly, the learning algorithm isbased on formulating a classification problem on imagepairs (similar/dissimilar), and constructing an embeddingbased on a labeled training set of such pairs.

Once the desired M -dimensional embedding H =[h1(I), . . . , hM (I)] has been learned, the induced sim-ilarity is the Hamming distance in H: S(Ia, Ib) =∑M

m=1 |hm(Ia)− hm(Ib)|.This potential could conceivably be used in a continuous

domain, for example by using Monte Carlo methods in theCRF framework, as it captures features relevant to pose esti-mation better than generic image similarity. Unfortunatelyit would not reduce computational cost since it would re-quire rendering the image Iθ at runtime for every pose θwhich we would like to evaluate.

2While both transition and observation potentials in a CRF are oftentrained jointly, it is possible to train them separately, as we do in this case.

3Here dθ stands for the appropriate distance in pose space

3

Page 5: Conditional Random People: Tracking Humans with CRFs and ...publications.csail.mit.edu/tmp/MIT-CSAIL-TR-2005-079.pdf · Conditional Random People: Tracking Humans with CRFs ... real-time

This approach becomes particularly efficient when wehave a finite (albeit large) set of possible pose hypothesesθ1, . . . , θN . In such a case we can render an image Ii foreach pose in the set, and compute its embedding H(Ii).The only calculation required at runtime is computing theembedding H(It) and calculating the Hamming distancesbetween the bit vectors. We capitalize on this efficiency inthe grid-filtering framework described in the next section.

4 Grid FilteringIn the previous section we have proposed a CRF trackingframework where the observation potential is computed asthe distance between embeddings of state and observationdescribed in the previous section. Computing this potentialfor an arbitrary pose and image is relatively slow since itwould involve rendering an image of a person in this poseand then computing the embedding. This is part of theproblem with generative-model-based tracking which wewanted to avoid.

Fortunately, if all of the poses where the observation po-tential is to be evaluated are known in advance, then we canprecompute the appropriate embedding off-line, drasticallyreducing runtime evaluation cost. We would then computea single embedding for the observed image, which would beamortized when the potential is evaluated at multiple poses.

While fixing the poses in advance seems too restrictivefor continuous space inference, grid-based techniques pio-neered by [4, 16] show that this can be a profitable approx-imation. The main idea underlying these methods is thatmany functions of interest can be approximated by piece-wise constant functions, if the region of support for eachconstant “piece” is small enough. As metioned above, wefollow the convention and denote such region of support asa “cell”.

In our case, the function we are interested in is the pos-terior probability of the pose conditioned on all previouslyseen observations (including the current one). The poste-rior is proportional to the product of the temporal prior (thepose probability based on the estimate at the previous time-step and the motion model) and the observation potential.We would like to define the cells such that both of the com-ponents are almost constant. The observation potential isoften sharply peaked, so the cells should be small in the re-gions of pose space where we expect large appearance vari-ations, but large in other regions. On the other hand themotion models are usually (and our work is no exception)very approximate and compensate for it by inflated dynamicnoise. Thus the temporal prior is broad and should also beapproximately constant on cells small enough for observa-tion potential constancy. We derive the grid filter based onthe assumption that the partition of the pose space into cellswith the properties described above is available.

Let the space of all valid poses Θ be split into N dis-joint (and not necessarily regular) cells Ci, Θ = ∪N

i=1Ci,Ci ∩ Cj = ∅, i 6= j, such that both likelihood and prior canbe approximated as constant within each cell. Furthermore,let us have a sample θi ∈ Ci in every cell. The set of sam-ple points {θi}N

1 is referred to as “grid” in the grid-filteringframework.

By virtue of our assumptions, the temporal prior can beexpressed as

p(θt ∈ Ci|θt−1 ∈ Cj) =∫Ci

∫Cj

φ(θt, θt−1)dθt−1dθt (3)

≈ φ(θi, θj)|Ci||Cj |,

where |Ci| is the volume of the ith cell, with the approxi-mation valid when the noise covariance in the transition ismuch wider than the volume of the cell. So the (time inde-pendent) transition probability from jth to ith cell is

Tij =φ(θi, θj)|Ci|∑N

k=1 φ(θk, θj)|Ck|. (4)

The compatibility between observation and the pose be-longing to a particular cell can be written as

φto(Ci) =

∫Ci

φto(θ)dθ ≈ φt

o(θi)|Ci|. (5)

Combining eqs 1, 3, and 5, the posterior probability ofpose being in the ith cell is

p(θt ∈ Ci|I1..t) ≈ 1Z

φto(θi)|Ci|

N∑j=1

Tijp(θt−1 ∈ Cj |I1..t−1)

(6)

=1Z

φto(θi)

N∑j=1

Sijp(θt−1 ∈ Cj |I1..t−1),

where Sij = |Ci|Tij is time independent and can be com-puted offline.

If we denote

πt =

p(θt ∈ C1|I1..t)p(θt ∈ C2|I1..t)

...p(θt ∈ CN |I1..t)

and lt =

φt

o(θ1)φt

o(θ2)...

φto(θN )

,

then the posterior can be written in vector form

πt =1W

Sπt−1. ∗ lt, (7)

where .∗ is the element-wise product, and the scaling factorW =

∑Ni=1(Sπt−1. ∗ lt)i is necessary for probabilities to

sum up to unity.

4

Page 6: Conditional Random People: Tracking Humans with CRFs and ...publications.csail.mit.edu/tmp/MIT-CSAIL-TR-2005-079.pdf · Conditional Random People: Tracking Humans with CRFs ... real-time

Algorithm CRP CRPS kNN ICP CND ELMOSeconds 0.05 0.07 0.5 0.1 120 8

Table 1: Average times required for algorithms tested toprocess a single frame.

The final equation has striking resemblance to thestandard HMM update equations. It defines our on-line CONDITIONAL RANDOM PERSON tracking algorithm(CRP). We can also use standard HMM inference methods[20] to define a batch version of CRP: CRP SMOOTHED(CRPS) uses a forward-backward algorithm to find the posedistribution at every time step conditioned on all observedimages. In addition, the most likely pose sequence can befound by using Viterbi decoding and we call the resultingmethod CRP VITERBI (CRPV).

5 Implementation and EvaluationWe have implemented CRP and CRPS as described in theprevious sections. We have used the database of 300,000pose exemplars generated from a large set of motion capturedata in order to cover a range of valid poses. The imagesare synthetic, and were rendered, along with the foregroundsegmentations masks, in Poser [7] for a fixed viewpoint.The motion-capture sequences are available from [12] andinclude large body rotations, complex motions, and self-occlusions. The transition matrix was computed by locating1000 nearest neighbors in joint position space for each ex-emplar, and setting the probability of transitioning to eachneighbor to the be Gaussian with σ = 0.25. The volumeof each cell was approximated as that of a ball with radiusequal to the median distance to 50 nearest neighbors.

We used the multiscale edge direction histogram(EDH) [23] as the basic representation of images. Thebinary embedding H is obtained by thresholding individ-ual bins in the EDH. It was learned using a training set of200,000 image pairs with similar underlying poses (we fol-lowed an approach outlined in [29] for estimating false neg-ative rate of a paired classifier without explicitly samplingdissimilar pairs). This resulted in 2,575 binary dimensions.

The tracking algorithms are initialized by searching for50 exemplars in the database closest to the first frame in thesequence in the embedding space.

Due to the sizes of the database and the transition ma-trix, both algorithms require large amounts of memory, sowe performed our tests on a computer with 3.4GHz Pen-tium 4 processor and 2GB of RAM. The algorithms wereimplemented in C++, and were able to achieve real-timeperformance with average speeds of 20 frames per secondfor CRP and 14 frames per second for CRPS.

5.1 Experiments with synthetic data

We have quantitatively evaluated the performance of ourtracking method on a set of motion sequences. These se-quences were obtained in the similar way as the sequenceused for training our algorithm but were not included in thetraining set.

We compared our online algorithm, CRP, and its batchversion CRPS (CRP SMOOTHED), to four state-of-the-artpose estimation algorithms. The first baseline was a state-less k-Nearest Neighbors (kNN) algorithm that at everyframe searches the whole database for 50 closest posesbased on the embedding distance. The remaining base-line methods were incremental tracking algorithms: deter-ministic gradient descent method using the Iterative Clos-est Point (ICP) algorithm [8], CONDENSATION [25], andELMO [9]. The ICP algorithm directly maximizes the like-lihood function at every frame, whereas CONDENSATIONand ELMO evaluated the full posterior distribution. In ourexperiments, the posterior distribution in CONDENSATIONwas modeled using 2000 particles. In ELMO, the poste-rior distribution was modeled using a mixture of 5 Gaus-sians. The likelihood function defined in ICP, CONDEN-SATION and ELMO was based on the Euclidean distancebetween the articulated model and the 3D (reconstructed)points of the scene obtained from a real-time stereo system.In contrast, both CRP and kNN algorithms require only sin-gle view intensity images and foreground segmentation.

We have chosen to use the mean distance between es-timated and true joint positions as an error metric [3]. InFigure 2 we show the performance of 6 algorithms de-scribed above on four synthetic sequences.4 As can be seen,both CRP and CRPS consistently outperform kNN, andCONDENSATION5, and compare favorably to ICP. WhileCRP produces somewhat worse results than ELMO, it doesnot use stereo data, and is hundred and sixty times faster.The timing information for all compared algorithms is pre-sented in Table 1.6

5.2 Statistical analysis of results

In order to evaluate the statistical significance of these re-sults we used the following methodology. We use the meanjoint position error as a measure of accuracy of pose predic-tion on a given frame. Suppose that algorithms A and B areboth tested on the total of N frames, producing on the framei errors eA

i and eBi , respectively. The quantity of interest is

the error difference dA−Bi = eA

i − eBi . Figure 3 shows

4These results reflect correction of an error in earlier CRPS implemen-tation.

5Increasing the number of particles used for CONDENSATION should im-prove performance, but the computational costs would become prohibitive.

6We have used more iterations of gradient descent than the implemen-tation described in [9].

5

Page 7: Conditional Random People: Tracking Humans with CRFs and ...publications.csail.mit.edu/tmp/MIT-CSAIL-TR-2005-079.pdf · Conditional Random People: Tracking Humans with CRFs ... real-time

0 100 200 300 400 500 600 7000

0.1

0.2

0.3

0.4

0.5

0.6

Frame

Met

ers

Mean errors in joint positions on sequence: applause2

CRPCRPSkNNICPCNDELMO

0 100 200 300 400 500 600 7000

0.1

0.2

0.3

0.4

0.5

0.6

Frame

Met

ers

Mean errors in joint positions on sequence: brushTeeth2

CRPCRPSkNNICPCNDELMO

(a) (b)

0 100 200 300 400 500 600 7000

0.1

0.2

0.3

0.4

0.5

0.6

Frame

Met

ers

Mean errors in joint positions on sequence: dryer2

CRPCRPSkNNICPCNDELMO

0 100 200 300 400 500 600 7000

0.1

0.2

0.3

0.4

0.5

0.6

Frame

Met

ers

Mean errors in joint positions on sequence: salute−chest−azumi

CRPCRPSkNNICPCNDELMO

(c) (d)

Figure 2: Comparing algorithm performance on four synthetic sequences: “applause” (a), “brush teeth”(b), “dryer” (c), and“salute” (d). The error is measured as an average distance between true and estimated joint positions. The graphs are bestviewed in color.

CRP

−0.5 −0.3 −0.1 0 0.1 0.3 0.50

200

400

600

800

1000

1200

1400

Freq

uenc

y C

ount

s

−0.5 −0.3 −0.1 0 0.1 0.3 0.50

50

100

150

200

250

300

350

400

−0.5 −0.3 −0.1 0 0.1 0.3 0.50

20

40

60

80

100

120

140

−0.5 −0.3 −0.1 0 0.1 0.3 0.50

50

100

150

200

250

300

vs. kNN: 73.1% / 2cm vs. ICP: 51.5% / 4cm vs. CND: 73.1% / 9cm vs. ELMO: 20.4% / -6cm

CRPS

−0.5 −0.3 −0.1 0 0.1 0.3 0.50

200

400

600

800

1000

1200

Freq

uenc

y C

ount

s

−0.5 −0.3 −0.1 0 0.1 0.3 0.50

50

100

150

200

250

300

350

400

450

−0.5 −0.3 −0.1 0 0.1 0.3 0.50

20

40

60

80

100

120

140

−0.5 −0.3 −0.1 0 0.1 0.3 0.50

50

100

150

200

250

300

vs. kNN: 73.5% / 3cm vs. ICP: 52.6% / 5cm vs. CND: 73.7% / 9cm vs. ELMO: 22.4% / -5cm

Figure 3: Distributions of improvements in joint position estimates of CRP (first row) and CRPS (second row) vs. kNN (firstcolumn), ICP (second), CONDENSATION (third), and ELMO (fourth). Negative values along the x-axis mean lower error forthe proposed algorithm. Given for each comparison are the proportion of frames in which CRP/CRPS were better than thealternative, and the average improvement in error. See text for results on statistical significance.

6

Page 8: Conditional Random People: Tracking Humans with CRFs and ...publications.csail.mit.edu/tmp/MIT-CSAIL-TR-2005-079.pdf · Conditional Random People: Tracking Humans with CRFs ... real-time

the distribution of these differences between our algorithms(CRP and CPRS) and competing algorithms computed overa large number of synthetic sequences. For example, thetop right plot shows the distribution of dCRP−ELMO. Neg-ative value of dA−B

i means that on frame i the algorithm Awas better than the algorithm B. The lack of a parametricmodel for the distribution of dA−B makes it difficult to ap-ply thorough statistical testing to hypotheses involving themean of that distribution. Therefore, the anlysis below fo-cuses on the median, which lends itslef more easily to non-parametric tests.

One question we can ask is whether the results supportthe conclusion that A is expected to be better more than halfthe time. We answer this question using the binomial signtest [24]. Intuitively, it is equivalent to modeling the out-come of each comparison (on one frame) by a coin flip inwhich “tails” means that the sign of dA−B is negative. Thenull hypothesis we wish to reject is that the coin is fair. Weapplied this test to the data histogrammed in Figure 3, us-ing p-value7 of p = 0.001. At this significance level, CRPwas better than kNN and CONDENSATION and worse thanELMO. CRPS was better than kNN, CONDENSATION andICP and worse than ELMO; we could not establish signifi-cant differences in error of CRP vs. ICP.

A more refined statistical evaluation of the difference inperformance between two estimation algorithms is based onestablishing a confidence interval on the median improve-ment. Given the desired confidence value p we seek a valueD such that the probability of the median difference of theerrors being above D is less than p.

We apply the following procedure to perform this test.Suppose that D is the q-upper quantile of the observed dis-tribution of dA−B , i.e. qN values are above D. Under theassumption that the true median of the distribution lies be-low D, we have PD = Pr(dA−B ≥ D) < 1/2. Now, wedefine a random variable ZD that is the count of observedvalues of dA−B that exceed D. Its distribution is binomialwith parameters PD and N . Consequently,

Pr(ZD ≥ qN) = Bino(ZD;PD, N) < Bino(ZD; 1/2, N),(8)

where Binomial(x; p, n) =(nx

)px(1 − p)n−x. Using De

Moivre-Laplace approximation [19],

Bino(ZD; 1/2, N) ≈ N(ZD, N/2,√

N/2), (9)

where N(x; µ, σ) = 1σ√

2πexp(−(x−µ)2/2σ2). Combin-

ing (8) and (9), and solving for the desired signifcance p,we get

q = G−1(1− p; N/2,√

N/2)/N,

7The p-value is the probability of obtaining the observed data underthe null hypothesis; that hypothesis is rejected if the p-value falls below aspecified threshold, which determines the significance of the test.

Method kNN ICP CND ELMOCRP vs. -1.5cm 0.04cm -7.13cm 4.57cmCRPS vs. -1.75cm -0.28cm -7.47cm 4.27cm

Table 2: Confidence intervals for median error reduction,with p = 0.001: with probability 1 − p, the true medianof dA−B falls below the value for row A and column B.Negative values indicate cases where we are confident withrespect to the improvement (error reduction) achieved byCRP/CRPS over competing methods.

where G is the inverse of the normal (gaussian) cumulutivedistribution function. In other words, if we choose the valueof D corresponding to such q, the probability of the true me-dian being lower than D is at least 1−p. Results of this testfor p = 0.001 are given in Table 2: there is a robust advan-tage to both CRP methods over kNN and Condensation, butnot over ELMO.

A relatively large difference between estimated mean(Figure 3) and median (Table 2) improvements of CRP vari-ants over ICP can be explained by the fact that ICP is morelikely to completely loose track (thus producing large er-rors) than CRP.

5.3 Experiments with real data

For the real data, segmentation masks were computed us-ing color background subtraction. Sample frames from acomplicated real image motion sequence are shown in theFigure 4 (the video of the original sequence and results ofour algorithm are available as supplementary material) andFigure 5. The top right pane in the supplementary videowas obtained by smoothing CRPV output and rendering theresulting pose sequence.

6 Conclusions and DiscussionWe have presented CRP, an algorithm for tracking articu-lated human motion in real-time. The main contributionsof this work are the discriminative CRF formulation of thetracking problem; use of similarity preserving embeddingfor modeling observation potential function; and the grid-filter inference algorithm that transforms the continuousdensity estimation problem into a discrete one. The result-ing algorithm is capable of accurately tracking complicatedmotions in real-time (20fps in our experiments for both syn-thetic and real data).

As future work we are interested in using extra domainknowledge to further improve the performance of the algo-rithm in two ways. First, when the set of poses that need tobe tracked is restricted, then the size of the sample database

7

Page 9: Conditional Random People: Tracking Humans with CRFs and ...publications.csail.mit.edu/tmp/MIT-CSAIL-TR-2005-079.pdf · Conditional Random People: Tracking Humans with CRFs ... real-time

Figure 4: Sample frames from a gesture sequence (first row), segmentation masks (second row) and the corresponding framesfrom a most likely sequence computed by CRPV algorithm (third row). See supplimentary material for the full video.

Figure 5: Sample frames from a gesture sequence (first row), segmentation masks (second row) and the corresponding framesfrom a most likely sequence conputed by CRPV algorithm (third row)

can be decreased by removing all of the unnecessary poses.Second, when the motion patterns are constrained, for ex-ample in a dance, tracking can be made more robust by us-ing a specialized transition matrix (resulting, in the limit, intracking on a motion graph).

Acknowledgments

We would like to thank Nati Srebro for help with statisticalanalysis of the results. Part of this work was funded by theOffice of Naval Research and by DARPA.

References

[1] Ankur Agarwal and Bill Triggs. Learning to track 3dhuman motion from silhouettes. In Proceedings of the21 st International Conference on Machine Learning,2004.

[2] M. S. Arulampalam, S. Maskell, N. Gordon, andT. Clapp. A tutorial on particle filters for on-line nonlinear/non-gaussian bayesian tracking. IEEEtrans. on Signal Processing, 50, Feb 2002.

[3] A. O. Balan, L. Sigal, and M. J. Black. A quantitativeevaluation of video-based 3d person tracking. In PETS2005, 2005.

8

Page 10: Conditional Random People: Tracking Humans with CRFs and ...publications.csail.mit.edu/tmp/MIT-CSAIL-TR-2005-079.pdf · Conditional Random People: Tracking Humans with CRFs ... real-time

[4] R. S. Bucy and K. D. Senne. Digital synthesis of non-linear filters. Automatica, pages 287–298, 1971.

[5] Tat-Jen Cham and James M. Regh. A mutiple hy-pothesis approach to figure tracking. Technical report,Compaq Cambridge Research Laboratory, 1998.

[6] Zhe Chen. Bayesian filtering: From Kalman filters toparticle filters, and beyond. Technical report, AdaptiveSystems Lab, McMaster University, 2003.

[7] Curious Labs, Inc., Santa Cruz, CA. Poser 5 - Refer-ence Manual, 2002.

[8] D. Demirdjian, T. Ko, and T. Darrell. Constraining hu-man body tracking. In IEEE International Conferenceon Computer Vision, pages 1071–1078, 2003.

[9] David Demirdjian, Leonid Taycher, GregoryShakhnarovich, Kristen Grauman, and TrevorDarrell. Avoiding the streetlight effect: Trackingby exploring likelihood modes. In Proceedings ofthe International Conference on Computer Vision,volume I, pages 357–364, 2005.

[10] Jonathan Deutscher, Andrew Davison, and Ian Reid.Automatic partitioning of high diminsional searchspaces associated with articulated body motion cap-ture. In Computer Vision and Pattern Recognition,Dec 2001.

[11] David E. DiFranco, Tat-Jen Cham, and James M.Regh. Reconstruction of 3-d figure motion from 2-d correspondences. In Computer Vision and PatternRecognition, 2001.

[12] {Eyes, JAPAN}. Motion capture sequences database.www.mocapdata.com, 2005.

[13] Dariu Gavrila. The visual analysis of human move-ment: A survey. Computer Vision and Image Under-standing, 73(1):82–98, 1999.

[14] David C. Hogg. Model-based vision: A program tosee a walking person. Image and Vision Computing,1(1):5–20, 1983.

[15] O. King and David A. Forsyth. How does CONDEN-SATION behave with a finite number of samples? InECCV (1), pages 695–709, 2000.

[16] S. C. Kramer and H. W. Sorenson. Recursive bayesianestimation using piece-wise constant approximations.Automatica, 24(6):789–801, 1988.

[17] John Lafferty, Andrew McCallum, and FernandoPereira. Conditional random fields: Probabilisticmodels for segmenting and labeling sequence data. In

Proc. 18th International Conf. on Machine Learning,pages 282–289. Morgan Kaufmann, San Francisco,CA, 2001.

[18] John MacCormick and Michael Isard. Partitionedsampling, articulated objects, and interface-qualityhand tracking. In ECCV (2), pages 3–19, 2000.

[19] A. Papoulis. Probability, random variables, andstochastic processes. McGrow Hill, New York, 3rdedition, 1991.

[20] Lawrence R. Rabiner. A tutorial on hidden markovmodels and selected applications in speech recogni-tion. In Readings in speech recognition, pages 267–296. Morgan Kaufmann Publishers Inc., San Fran-cisco, CA, USA, 1990.

[21] K. Rohr. Towards models-based recognition of humanmovements in image sequences. CVGIP, 59(1):94–115, Jan 1994.

[22] Romer Rosales and Stan Sclaroff. Inferring body posewithout tracking body parts. In Proc. IEEE Conf. onComputer Vision and Pattern Recognition, June 2000.

[23] G. Shakhnarovich, P. Viola, and T. Darrell. Fast poseestimation with parameter-sensitive hashing. In Proc.9th Intl. Conf. on Computer Vision, 2003.

[24] David J. Sheskin. Handbook of parametric andnonparametric statistical procedures. Chapman &Hall/CRC, 3rd edition edition, 2004.

[25] Hedvig Sidenbladh, Michael J. Black, and David J.Fleet. Stochastic tracking of 3d human figures using2d image motion. In Proc. European Conference onComputer Vision, 2000.

[26] Christian Sminchiesescu and Bill Triggs. Estimatingarticulated human motion with covariance scaled sam-pling. Int. J. Robotics Research, 22:371–391, June2003.

[27] C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas.Discriminative density propagation for 3d human mo-tion estimation. In Proc. IEEE Conf. on Computer Vi-sion and Pattern Recognition, 2005.

[28] B. Stenger, P. R. S. Mendonca, and R. Cipolla. Model-based hand tracking using an unscented kalman filter.Proc. British Machine Vision Conference, 2001.

[29] Y. Ke, D. Hoiem, and R. Sukthankar. Computer Visionfor Music Identification. In Proc. IEEE Conf. Com-puter Vision and Pattern Recognition, San Diego, CA,June 2005.

9