A Geometric Particle Filter for Template-Based Visual Tracking · EMPLATE-based object tracking is a particular form of visual tracking, in which the objective is to continuously

1

A Geometric Particle Filter for Template-BasedVisual Tracking

Junghyun Kwon, Hee Seok Lee, Frank C. Park, and Kyoung Mu Lee

Abstract—Existing approaches to template-based visual track-ing, in which the objective is to continuously estimate the spatialtransformation parameters of an object template over videoframes, have primarily been based on deterministic optimization,which as is well-known can result in convergence to local optima.To overcome this limitation of the deterministic optimizationapproach, in this paper we present a novel particle filter-ing approach to template-based visual tracking. We formulatethe problem as a particle filtering problem on matrix Liegroups, specifically the Special Linear group SL(3) and the two-dimensional affine group Aff (2). Computational performanceand robustness are enhanced through a number of features:(i) Gaussian importance functions on the groups are iterativelyconstructed via local linearization; (ii) the inverse formulationof the Jacobian calculation is used; (iii) template resizing isperformed; and (iv) parent-child particles are developed andused. Extensive experimental results using challenging videosequences demonstrate the enhanced performance and robust-ness of our particle filtering-based approach to template-basedvisual tracking. We also show that our approach outperformsseveral state-of-the-art template-based visual tracking methodsvia experiments using the publicly available benchmark dataset.

Index Terms—Visual tracking, object template, particle fil-tering, Lie group, special linear group, affine group, Gaussianimportance function.

I. INTRODUCTION

TEMPLATE-based object tracking is a particular form ofvisual tracking, in which the objective is to continuously

fit an object template to a sequence of video frames [62].Beginning with the seminal work of Lucas and Kanade [37],template-based visual tracking has primarily been formulatedas a nonlinear optimization problem, e.g., [18], [9], [3], [8],[11]. As is well known, however, nonlinear optimization-basedtracking methods have the disadvantage of potentially becom-ing trapped in local minima, particularly when inter-frameobject motions are large. Global optimization strategies suchas multi-scale pyramids and simulated annealing, despite theirconsiderably larger computational requirements, also cannotconsistently ensure better tracking performance.

In this paper we propose an alternative approach totemplate-based visual tracking, based on geometric particlefiltering. Particle filtering can be regarded as a Monte Carlomethod for solving the (unnormalized) conditional density

J. Kwon is with TeleSecurity Sciences, Inc., USA (e-mail:[email protected]). H.S. Lee and K.M. Lee are with theAutomation System Research Institute, the Department of ElectricalEngineering and Computer Science, Seoul National University, Korea(e-mail: [email protected], [email protected]). F.C. Park is with theSchool of Mechanical and Aerospace Engineering, Seoul National University,Korea (e-mail: [email protected]).

equations for state estimation [14], in which the state andobservation equations are modelled as nonlinear stochasticdifferential equations. Because of its stochastic nature, particlefiltering can, when done properly, be made more robustto the problem of local minima than existing deterministicoptimization-based methods, without sacrificing much in theway of computationally efficiency. Moreover, particle filteringyields better estimation performance for nonlinear problemslike template-based visual tracking than, e.g., extensions ofthe linear Kalman filter, because of its non-parametric densityrepresentation.

In applying particle filtering to template-based visual track-ing, the first issue that needs to be addressed is how torepresent the state. The estimated transformation parametersare typically in the form of matrices satisfying certain groupproperties—in our case these will be homographies and affinetransformations—and the subsequent state space is no longera vector space, but a curved space. To be able to directlyapply existing particle filtering algorithms, which are primarilydeveloped in the context of systems whose state space is avector space, one must therefore choose a local coordinaterepresentation for the curved state space, but there are severalwell-known problems associated with this approach: (i) severallocal coordinate patches may be required to cover the entirestate space, and (ii) the choice of local coordinates willstrongly affect tracking performance. See [2] for a detaileddiscussion on coordinate parametrizations for homographies.

A second issue involves the choice of importance function,a critical step in particle filtering. The optimal importancefunction minimizes the particle weight variance, and even-tually maintains the number of effective particles to be aslarge as possible [15], so that accurate tracking is ensuredeven with a limited number of particles. Since for generalnonlinear systems the optimal importance function cannot bedetermined in closed-form, a computationally efficient andsufficiently accurate approximation is required.

The main contributions of this paper directly address thesetwo fundamental issues:

• Rather than applying conventional vector space particlefiltering to some local coordinate formulation of thetemplate-based visual tracking problem, we instead de-velop a geometric particle filtering algorithm that directlytakes into account the curved nature of the state space.Specifically, homographies can be identified with thethree-dimensional Special Linear group SL(3), while thetwo-dimensional affine transformation matrices can beidentified with the two-dimensional affine group Aff (2).As is well-known, both SL(3) and Aff (2) possess the

2

structure of an algebraic group and a differentiable mani-fold, or Lie group. Our resulting algorithms are geometric(in the sense of being local coordinate-invariant), so thattheir performance does not depend on the choice of localcoordinates.

• We derive an exponential coordinate-based approximationof Gaussian distributions for Lie groups, and use theseas optimal importance functions for particle sampling.This approach can in some sense be regarded as ageneralization of the Gaussian importance function of[15] to our Lie groups in question.

• We develop several techniques that, collectively, en-hance the accuracy and computational efficiency of ourtemplate-based visual tracking in a significant way: theiterative Gaussian importance function generation, inverseformulation of Jacobian calculation, template resizing,and use of parent-child particles.

The overall performance of our tracking framework isdemonstrated in Section VII via experiments with various chal-lenging test video sequences. The quantitative performancecomparison between the previously employed importancefunctions and ours as well as that between different parametersettings is also performed in Section VII. We finally comparethe performance of our tracking framework with several state-of-the-art template-based visual tracking methods using theMetaio benchmark dataset [35] in Section VII. The comparedmethods include the ESM tracker [8] and L1 tracker [4], whichare widely regarded as the state-of-the-art template-basedvisual tracking algorithms based on deterministic optimizationand particle filtering, respectively. Key-point matching-basedmotion estimation methods using descriptors such as SIFT[36], SURF [6], and FERNS [40] are also compared with ourframework.

II. RELATED WORK

While regarded as more straightforward relative to othervisual tracking problems [24], [41], [63], template-based visualtracking continues to arise in a wide range of mainstreamapplications, e.g., visual servoing [8], augmented reality [45],[21], and visual localization and mapping [51], [29]. What iscommon between such applications is that it is necessary toestimate the 3-D camera motion from tracking results. In orderto do so, we should accurately estimate the object templatemotion up to the homography that is the most general spatialtransformation of planar object images, i.e., we should performprojective motion tracking for those applications.

However, previous template-based visual tracking algo-rithms employing particle filtering [64], [32], [46], [31], [33],[4] are all confined to affine motion tracking, i.e., template-based visual tracking with affine transformations. The ex-ception is [60], but the accuracy of the tracking results aspresented leave something to be desired. To the best of ourknowledge, ours is the first work to solve projective motiontracking via particle filtering, with sufficient accuracy andefficiency by employing Gaussian importance functions toapproximate the optimal importance function.

Following the seminal work of Isard and Blake to success-fully apply particle filtering to contour tracking [24], the efforts

to approximate the optimal importance function for visualtracking have mainly concentrated on contour tracking [47],[34], [49], [44] by means of the unscented transformation (UT)[25]. One notable work is [1], in which the optimal importancefunction is analytically obtained with linear Gaussian measure-ment equations for point set tracking. When approximatinga nonlinear function, UT is generally recognized as moreaccurate than first-order Taylor expansions [25]. However,in our approach we rely on Taylor expansions instead ofUT, because UT requires repeated trials of the measurementprocess involving image warping with interpolation, whichmay prohibit real-time tracking. Moreover, it is not clear howto choose in a systematic way the many parameters requiredin UT.

Regression-based learning also has been successfully ap-plied to template-based visual tracking [26], [65], [22], [57],[33]. Regression methods estimate a regression relation be-tween the spatial transformation parameters and image simi-larity values using training samples, which are then directlyused in tracking. It is notable that the regression is performedon Aff (2) in [57] and [33]. Since our framework does notrequire learning with training samples, the comparison withregression-based methods is not considered here.

Previous work on particle filtering on Lie groups wasfocused on the special orthogonal group SO(3) representing 3-D rotation matrices [12], [53] and the special Euclidean groupSE(3) representing 3-D rigid body transformation matrices[52], [28]. Recently, Gaussian importance functions on SE(3)obtained by UT were utilized for visual localization andmapping with a monocular camera [29]. In [33], particlefiltering on Aff (2) is employed as a basic inference tool foraffine motion tracking. It is worth to note that the unscentedKalman filtering (or equivalently UT) on general Riemannianmanifolds has been formulated in [20] , which can be usedto approximate the optimal importance function for particlefiltering on Lie groups.

Two previous papers by the authors, [31] and [30], are themost closely related with this paper. Compared with [31], weextend the framework from affine motion tracking to projectivemotion tracking, and utilize Gaussian importance functionsinstead of the state transition density-based importance func-tions. [30] is a preliminary conference version of this paper.In addition to the extension to projective motion tracking,we have made several important improvements over [30] toincrease the accuracy and efficiency that include the iterativeGaussian importance function generation, inverse Jacobianformulation, and use of parent-child particles.

III. BACKGROUND

A. Lie Group

A Lie group G is a group which is a differentiable manifoldwith smooth product and inverse group operations. The Liealgebra g associated with G is identified as a tangent vectorspace at the identity element of G. A Lie group G and its Liealgebra g are related via the exponential map, exp : g → G.The logarithm map log is defined as the inverse (near theidentity) of exp, i.e., log : G → g. For the matrix Lie groups

3

E1 E2 E3 E4

E5 E6 E7 E8

Fig. 1. The spatial transformations induced by exponentials of Ei of sl(3)in (4). The images are made by transforming a square centered at the originwith exp(εiEi) where εi are small positive (dotted lines) or negative scalars(dash-dot lines).

we address, exp and log are given by the ordinary matrixexponential and logarithm, defined as

exp(x) =

∞∑m=0

xm

m!, (1)

log(X) =

∞∑m=1

(−1)m+1 (X − I)m

m, (2)

where x ∈ g, X ∈ G, and I is the identity matrix. For suffi-ciently small x, y ∈ g, the Baker-Campbell-Hausdorff (BCH)formula states that z satisfying exp(z) = exp(x) exp(y) isgiven by

z = x+ y +1

2[x, y] +

1

12[x, [x, y]]− 1

12[y, [x, y]] + . . . , (3)

where [·, ·] is the matrix commutator (Lie bracket) given by[A,B] = AB−BA. A more detailed description of Lie groupsand Lie algebras can be found in, e.g., [19].

As is well known, the exponential map exp locally definesa diffeomorphism between a neighborhood of G containingthe identity and an open set of the Lie algebra g centered atthe origin, i.e., given X ∈ G sufficiently near the identity,the exponential map X = exp(

∑i uiEi), where Ei are the

basis elements of g, is a local diffeomorphism. These localexponential coordinates can be extended to cover the entiregroup by left or right multiplication, e.g., the neighborhood ofany Y ∈ G can be well defined as Y (u) = Y · exp(

∑i uiEi).

Note also that in a neighborhood of the identity, the minimalgeodesics (i.e., paths of shortest distance) are given preciselyby these exponential trajectories (and their left- and right-translated versions).

B. Transformation Matrices as Lie Groups

A projective transformation is represented by a linear equa-tion in the homogeneous coordinates with a nonsingular matrixH ∈ �3×3 called a homography. To remove scale ambiguityof homogeneous coordinates, homographies can be normalizedto have the unit determinant. In this case, homographies canbe identified with the 3-D special linear group SL(3) definedas SL(3) = {X ∈ GL(3) : det(X) = +1} where GL(3) isthe 3-D general linear group corresponding to the set of 3× 3invertible real matrices. sl(3), the Lie algebra associated with

y

x

y

x

Template T(p)

Initial Frame k-th Frame

Similarity Measure

Measurement y

X =I0

k

I (W(p;X ))k k

W(p;X )k

Fig. 2. The graphical representation of the state and measurement for ourprobabilistic template-based visual tracking framework.

SL(3), corresponds to the set of 3 × 3 real matrices with thezero trace. We choose the basis elements of sl(3) as

E1 =[1 0 00 −1 00 0 0

], E2 =

[0 0 00 −1 00 0 1

], E3 =

[0 −1 01 0 00 0 0

],

E4 =[0 1 01 0 00 0 0

], E5 =

[0 0 10 0 00 0 0

], E6 =

[0 0 00 0 10 0 0

],

E7 =[0 0 00 0 01 0 0

], E8 =

[0 0 00 0 00 1 0

]. (4)

The spatial transformations induced by exponentials of Ei

are shown in Fig. 1. We can define the map vsl(3) :sl(3) → �8 to represent sl(3) elements in column vec-tors with respect to the basis elements given in (4). Forx =

[x1 x4 x7x2 x5 x8x3 x6 x9

]∈ sl(3), the map is given by vsl(3)(x) =

(x1, x9,x2−x4

2 , x2+x4

2 , x7, x8, x3, x6)�.

Note that our choice of sl(3) basis elements in (4) is differentfrom the one in [7] and [8]. The difference is that we separatelyrepresent the rotation and skew by E3 and E4, respectively.This helps to set the state covariance for particle filteringproperly in practice. For example, if we can expect the objectrotational motion is not significant, we can set the covariancevalue corresponding to E3 smaller to maximize the trackingperformance with the limited computational power.

An affine transformation is also frequently used in template-based visual tracking; such a transformation is representedby a matrix of the form [A b

0 1 ] where A belongs to GL(2),the set of 2 × 2 invertible real matrices, and b ∈ �2. Wecan regard the set of affine transformation matrices as theaffine group Aff (2), which is a semi-direct product of GL(2)and �2. The Lie algebra associated with Aff (2), denotedaff (2), is represented as [ U v

0 0 ] where U belongs to gl(2)and v ∈ �2. gl(2) is the Lie algebra associated with GL(2)and corresponds to the set of 2 × 2 real matrices. The sixbasis elements of aff (2) can be chosen as the same as thefirst six basis elements of sl(2) in (4) with the exceptionof E2. E2 of Aff (2) is given by E2 =

[1 0 00 1 00 0 0

]represent-

ing isotropic scaling. The map vaff(2) : x → �6 is givenby vaff(2)(x) = (x1+x4

2 , x1−x4

2 , x2−x3

2 , x2+x3

2 , x5, x6)� where

x =[x1 x3 x5x2 x4 x60 0 0

]∈ aff(2).

IV. PROBABILISTIC TRACKING FRAMEWORK

In this section, we describe how projective motion trackingcan be realized by particle filtering on SL(3) with appropriatedefinitions of state and measurement equations. As shown inFig. 2, the initial state X0 becomes a 3 × 3 identity matrix.

4

Representing the image pixel coordinates by p = (px, py)�,

the objective is to find Xk ∈ SL(3) that transforms the objecttemplate T (p) to the most similar image region of the k-thvideo frame Ik.

A. State Equation on SL(3)

In order to formulate a state equation on SL(3), we firstconsider a left-invariant stochastic differential equation onSL(3) of the form

dX = X ·A(X) dt+X

8∑i=1

bi Ei dwi, (5)

where X ∈ SL(3) is the state, the map A : SL(3) → sl(3)is possibly nonlinear, the bi are scalar constants, the dwi ∈� denote independent Wiener process noise, and the Ei arethe basis elements of sl(3) given in (4). The continuous stateequation (5) can be represented in a more tractable form bythe simple first-order Euler discretization:

Xk = Xk−1 · exp(A(X, t)Δt+ v−1

sl(3)(dWk)√Δt

), (6)

where dWk ∈ �8 is the Wiener process noise with a covari-ance P ∈ �8×8 rather than the identity matrix, in which thebi terms in (5) are reflected.

A(X, t) in (6) can be regarded as the state dynamics.Here we use a simple state dynamics based on the firstorder auto-regressive (AR) process on SL(3) approximatingthe generalized AR process on Riemannian manifolds of [61]using the logarithm map. The state equation on SL(3) that weactually use for projective motion tracking is then representedas

Xk = Xk−1 · exp(Ak−1 + v−1

sl(3)(dWk)√Δt

), (7)

Ak−1 = a · log(X−1

k−2Xk−1

), (8)

where a is the AR parameter, usually set to be smaller than1. We set a = 0.5 for all experiments in Section VII. We canassume in general tracking applications that X−1

k−2Xk−1 lieswithin the injectivity radius of exp, because video frame ratestypically range at least between 15 and 60 frames per second(fps), and object speeds are not excessively fast; thus Ak−1

can be considered to be uniquely determined.

B. Measurement Equation

To represent the measurement equation in a form suchthat analytic Taylor expansion is possible, we utilize thewarping function W (p;Xk) employed in [3]. For a projectivetransformation of the pixel coordinates p = (px, py)

� of the

template T (p) with Xk =[a1 a4 a7a2 a5 a8a3 a6 a9

], W (p;Xk) is defined as

W (p;Xk) =

[p′xp′y

]=

[a1px+a4py+a7

a3px+a6py+a9a2px+a5py+a8

a3px+a6py+a9

]. (9)

The region corresponding to Xk can be expressed byIk(W (p;Xk)), which is warped back to the template coor-dinates involving image interpolation. In our implementation,we simply use the nearest neighbor interpolation for compu-tational efficiency. The measurement is then a measure of the

image similarity between T (p) and Ik(W (p;Xk)), and canbe represented by g(T (p), Ik(W (p;Xk))).

We regard the object template T (p) as a set composed ofthe initial template Tinit(p) and the incremental PCA model of[46] with the mean T (p) and M principal eigenvectors bi(p).Then our measurement equation is given by

yk = g(T (p), Ik(W (p;Xk))) + nk =

[gNCC

gPCA

]+ nk, (10)

where nk ∼ N(0, R), R = diag(σ2NCC, σ

2PCA). Here, gNCC is a

normalized cross correlation (NCC) score between Tinit(p) andIk(W (p;Xk)), while gPCA is a reconstruction error determinedby

∑p

[Ik(W (p;Xk))− T (p)

]2−∑Mi=1 c

2i , where ci are the

projection coefficients of the mean-normalized image to bi(p).The roles of NCC and PCA are mutually complementary:

incremental PCA deals with appearance changes such asillumination change and pose change of non-planar objects[46], which cannot be properly dealt with by NCC, whileNCC minimizes any possible drift problems associated withincremental PCA. The possible drawback of conjunctive useof NCC and PCA is the tracking failure due to too largeerror in NCC for severe local appearance changes such asspecular reflection. As a simple remedy, we detect outlierswhose PCA reconstruction error is above the threshold (0.15in our implementation) and exclude them in calculating gNCC.

C. State Estimation via Particle Filtering

Tracking is now equivalent to estimating the filtering densityρ(Xk|y1:k), where y1:k ≡ {y1, . . . , yk}, via particle filteringon SL(3). Assume that {X(i)

k−1, w(i)k−1, i = 1, . . . , N}, a set of

SL(3) samples called particles and their weights, approximatesρ(Xk−1|y1:k−1) well. The new particles X(i)

k are then sampledfrom the importance function π(Xk|X(i)

0:k−1, y1:k), and the

weights w(i)k for X(i)

k are calculated as

w(i)k = w

(i)k−1

ρ(yk|X(i)k )ρ(X

(i)k |X(i)

k−1)

π(X(i)k |X(i)

0:k−1, y1:k). (11)

Then, {X(i)k , w

(i)k , i = 1, . . . , N} is considered to be dis-

tributed to approximate ρ(Xk|y1:k).ρ(X

(i)k |X(i)

k−1) in (11) is calculated with (7) and (8) as

ρ(X(i)k |X(i)

k−1) =1√

(2π)8 det(Q)exp

(−1

2d�1 Q

−1d1

),

(12)where d1 = vsl(3)(log((X

(i)k−1)

−1X(i)k ) − A

(i)k−1) and Q =

PΔt. The measurement likelihood ρ(yk|X(i)k ) in (11) is sim-

ilarly determined from (10) as

ρ(yk|X(i)k ) =

1√(2π)2 det(R)

exp

(−1

2d�2 R

−1d2

), (13)

where d2 = yk − y(i)k and y

(i)k = g(T (p), Ik(W (p;X

(i)k ))).

The real measurement yk is given by the value of g whenIk(W (p;Xk)) = T (p). That is, gNCC and gPCA of yk arealways given by 1 and 0, respectively.

Usually, the particles X(i)k are resampled in proportion

to their weights w(i)k in order to minimize the degeneracy

5

problem [14]. Note that a particle can be copied multipletimes as a result of resampling. Therefore, the same indices attime k− 2 and k− 1 can represent different particle historiesactually. Since it is not possible to compute A

(i)k−1 in (8)

immediately from X(i)k−2 and X

(i)k−1 due to this reason, the

state is augmented as {Xk, Ak} in practice.

Algorithm 1 An algorithm to obtain the mean of SL(3)samples {X1, . . . , XN}μ = X1

repeatΔXi = μ−1Xi

Δμ = exp(

1N

∑Ni=1 log(ΔXi)

)

μ = μΔμuntil || log(Δμ)|| < ε

Assuming that particle resampling is conducted at everytime step, the state estimation Xk can be given by thesample mean of the resampled particles. An efficient gradi-ent descent algorithm to obtain the mean of SL(3) samplesusing the Riemannian exponential map given by the form ofexp(−x�) exp(x + x�) and its inverse log map is presentedin [7]. For fast computation, we instead use the ordinarymatrix exponential and log as an approximation of Riemannianexponential and log maps using the BCH formula (3). Thecorresponding algorithm is given in Algorithm 1. Although theconvergence of Algorithm 1 cannot be guaranteed in the mostgeneral case, in our extensive experiments the sample mean ofthe resampled particles typically converges within just threeor four iterations, as long as we choose the particle whoseweight before resampling is the greatest among all particlesas the initial mean.

V. GAUSSIAN IMPORTANCE FUNCTION FOR PROJECTIVE

MOTION TRACKING

The performance of particle filtering depends heavily onthe choice of importance function. The state transition densityρ(Xk|X(i)

k−1) is popularly used as the importance functionbecause of its simplicity in implementation, as the particleweights are calculated simply in proportion to the measure-ment likelihood ρ(yk|X(i)

k ). However, since the object canmove unpredictably (i.e., differently from our assumption ofsmooth motion implied by the AR process-based state equationin (7) and (8)) and a peaked measurement likelihood (i.e., asmall covariance R for nk in (10)) is inevitable for accuratetracking, most particles sampled from ρ(Xk|X(i)

k−1) will havenegligible weights and be wasted; this will result in trackinginaccuracy.

In the ideal case, the variance of particle weights willbe minimum, i.e., evenly distributed particle weights. Theoptimal importance function minimizing the particle weightvariance is given by the form of ρ(Xk|X(i)

k−1, yk) [15], whichclearly states that the current measurement yk should beconsidered in particle sampling. Unfortunately, this optimalimportance function cannot be obtained in closed form forgeneral nonlinear state space models (except for those withlinear Gaussian measurement equations). Our measurement

equation (10) is clearly nonlinear because the warping functionW (p;Xk) itself is nonlinear and arbitrary video frames Ikare involved. Therefore, the optimal importance function mustbe properly approximated for accurate template-based visualtracking with a limited number of particles.

In this section, we present in detail how we can approximatethe optimal importance function as Gaussian distributions onSL(3) by following the Gaussian importance function approachof [15].

A. Gaussian Importance Function Approach

We consider the following vector state space model:

xk = f(xk−1) + wk, (14)

yk = g(xk) + nk, (15)

where xk ∈ �Nx , yk ∈ �Ny , and wk and nk are zero-mean Gaussian noise with covariances P ∈ �Nx×Nx andR ∈ �Ny×Ny , respectively. For this model, we briefly reviewthe principal idea of the Gaussian importance function in [15].

In [15], ρ(xk, yk|x(i)k−1) is first assumed to be jointly Gaus-

sian with the following approximated mean μ and covarianceΣ:

μ =

[μ1

μ2

]=

[E[xk|x(i)

k−1]

E[yk|x(i)k−1]

]=

[f(x

(i)k−1)

g(f(x(i)k−1))

],(16)

Σ =

[Σ11 Σ12

Σ�12 Σ22

]

=

[Var(xk|x(i)

k−1) Cov(xk, yk|x(i)k−1)

Cov(xk, yk|x(i)k−1)

� Var(yk|x(i)k−1)

]

=

[P PJ�

JP JPJ� +R

]. (17)

were yk is the approximation of yk by the first-order Taylorexpansion at f(x(i)

k−1) with the Jacobian J ∈ �Ny×Nx .

Then by conditioning ρ(xk, yk|x(i)k−1) on the recent mea-

surement yk via the conditional distribution formula formultivariate Gaussian, ρ(xk|x(i)

k−1, yk) is approximated asN(mk,Σk) where

mk = μ1 +Σ12(Σ22)−1(yk − μ2), (18)

Σk = Σ11 − Σ12(Σ22)−1(Σ12)

�. (19)

If g in (15) is linear, then ρ(xk|x(i)k−1, yk) is exactly determined

without approximation as the case of the Kalman filter. Inthe subsequent subsections, we show how we can apply thisGaussian importance function approach to our SL(3) case.

B. Gaussian Distributions on SL(3)

We first consider proper notions of Gaussian distributionson SL(3). Definitions of Gaussian distributions on generalRiemannian manifolds are given in e.g., [17] and [42]. Thesecoordinate-free characterizations, while geometrically appeal-ing, are much too computationally involved to be used fortemplate-based visual tracking, where fast computation isessential. Instead, we make use of the exponential coordinatesdiscussed in Section III-A. Note that Gaussian distributions on

6

the Lie algebra can be well-defined because the Lie algebra isa vector space. We therefore construct Gaussian distributionson SL(3) as the exponential of Gaussian distributions on sl(3)provided that the covariance values for Gaussian distribu-tions on sl(3) are sufficiently small enough to guarantee thelocal diffeomorphism property of the exponential map. Theexponential coordinates are preferable to other coordinates,since (locally in a neighborhood of the identity) the minimalgeodesics are given by exponentials of the form exp(xt),where x ∈ sl(3) and t ∈ �; noting that minimal geodesicsare the equivalent of straight lines on curved spaces, the ex-ponential map has the desirable property of mapping “straightlines” to “straight lines”.

With this background, we approximately construct a Gaus-sian distribution on SL(3) as

NSL(3)(X;μ,Σ) =1√

(2π)8 det(Σ)exp

(−1

2d�Σ−1d

),

(20)where d = vsl(3)(log(μ

−1X)). Here μ ∈ SL(3) is the meanand Σ ∈ �8×8 is the covariance for zero-mean Gaussiandistributions on sl(3). Random sampling on NSL(3)(μ,Σ) canthen be performed according to

μ · exp(v−1

sl(3)(ε)), (21)

where ε ∈ �8 is sampled from the zero-mean Gaussiandistribution on sl(3) with covariance Σ.

This definition can be regarded as an approximation of theone in [7], where we use the exponential map instead ofthe Riemannian exponential map as done in the case of theSL(3) sample mean in Section IV-C. On the rotation group,where a bi-invariant metric exists, the definition of Gaussiandistribution as given in [7] is identical to ours. We remarkthat the underlying notion behind our construction of Gaussiandistributions on Lie groups is quite similar to the nonlinearmean shift concept on Lie groups as given in [54], in whichmean shift on a Lie group is realized via the exponentialmapping of ordinary vector space mean shift on the Liealgebra.

C. SL(3) Case

To make our problem more tractable, we apply the BCHformula to (7) using only the first two terms of (3). Theresulting approximated state equation is given as

Xk = f(Xk−1) · exp(v−1

sl(3)(dWk)√Δt

), (22)

where f(Xk−1) = Xk−1 · exp(Ak−1). Comparing (22) withthe definition of random sampling from Gaussian on SL(3)in (21), we can easily identify that ρ(Xk|X(i)

k−1) corresponds

to NSL(3)(X∗k , Q), where X∗

k = f(X(i)k−1) and Q = PΔt.

Then we seek to obtain Taylor expansion of the measurementequation (10) at X∗

k1

As discussed in Section III-A, the neighborhood of X∗k

is parametrized by the exponential coordinates as X∗k(u) =

1For notational convenience we shall also use g(Xk) instead ofg(T (p), Ik(W (p;Xk))).

X∗k · exp(v−1

sl(3)(u)). Thus the first-order Taylor expansion of(10) at X∗

k can be realized by differentiating g(X∗k(u)) with

respect to u. The exponential coordinate-based approach toobtain the Taylor expansion can also be found in the literatureon optimization methods on Lie groups [56], [38], [58].

With the Jacobian J of g(X∗k(u)) evaluated at u = 0, yk at

X∗k can be approximated as

yk = g(X∗k) + Ju+ nk. (23)

Note that yk in (23) is essentially an equation of u representingthe local neighborhood of X∗

k . With a slight abuse of notation,we let the same u also denote the random variable of Gaussiandistributions on sl(3) for ρ(Xk|X(i)

k−1) = NSL(3)(X∗k , Q), i.e.,

ρ(u) = N(0, Q). Then analysis of ρ(Xk, yk|X(i)k−1) becomes

equivalent to that of ρ(u, yk) with X∗k .

We now regard ρ(u, yk) to be jointly Gaussian with thefollowing mean μ and covariance Σ:

μ =

[μ1

μ2

]=

[E[u]E[yk]

], (24)

Σ =

[Σ11 Σ12

Σ�12 Σ22

]

=

[Var(u) Cov(u, yk)

Cov(u, yk)� Var(yk)

]. (25)

Then the conditional distribution ρ(u|yk = yk) is determinedas N(u,Σk), with

u = μ1 +Σ12(Σ22)−1(yk − μ2), (26)

Σk = Σ11 − Σ12(Σ22)−1Σ�

12. (27)

The above implies that ρ(u) on sl(3) used for sampling Xk

from NSL(3)(X∗k , Q) is transformed from N(0, Q) to N(u,Σk)

by incorporating yk.The sampling of Xk with X∗

k and the updated N(u,Σk)can be realized as

X∗k · exp

(v−1

sl(3)(u+ ε)), (28)

where ε ∼ N(0,Σk). Again by using the BCH formulaapproximation, (28) can be rewritten as

X∗k · exp

(v−1

sl(3)(u))· exp

(v−1

sl(3)(ε)). (29)

This implies that ρ(Xk|X(i)k−1, yk = yk) is eventually given

by NSL(3)(mk,Σk) where mk = X∗k · exp(v−1

sl(3)(u)). That

is, the optimal importance function ρ(Xk|X(i)k−1, yk) is finally

approximated as NSL(3)(mk,Σk) with

u = μ1 +Σ12(Σ22)−1(yk − μ2), (30)

mk = X∗k · exp(v−1

sl(3)(u)), (31)

Σk = Σ11 − Σ12(Σ22)−1Σ�

12, (32)

Here, μ1 and Σ11 are immediately given by 0 and Q,respectively, because ρ(u) = N(0, Q). The remaining μ2,Σ12, and Σ22 are also straightforwardly determined as g(X∗

k),QJ�, and JQJ� + R, respectively. Particle sampling fromthe obtained NSL(3)(mk,Σk) is realized by (21), and thedenominator of (11) can be straightforwardly calculated by(20).

7

D. Jacobian Calculation

To obtain the Jacobian J of g(X∗k(u)) in (23), we employ

the chain rule

Ji =∂gi(X

∗k(u))

∂u

∣∣∣u=0

=∂gi(X

∗k)

∂Ik(p′)· ∇uIk(p′u), (33)

where p′ = W (p;X∗k), p

′u = W (p;X∗

k(u)), and ∇uIk(p′u) =∂Ik(p

′u)

∂u

∣∣u=0

. Here, Ji and gi represent the i-th row of J and

i-th element of g, respectively. ∂gi(X∗k)

∂Ik(p′) for gNCC and gPCA in(10) are given in [11] and [30], respectively.∇uIk(p′u) in (33) is related with the image gradient of Ik,

and can be further decomposed via the chain rule as

∇uIk(p′u) = ∇Ik(p′) ·∂p′

∂X∗k

· ∂X∗k(u)

∂u

∣∣∣u=0

. (34)

The first term ∇Ik(p′) of (34) is the image gradient of Ikcalculated at the coordinates of Ik and warped back to thecoordinates of T (p). The second term ∂p′

∂X∗k

of (34) can be ob-

tained by differentiating (9) with respect to a = (a1, . . . , a9)�,

where X∗k is given by X∗

k =[a1 a4 a7a2 a5 a8a3 a6 a9

]; the result is[

px

p30 −p1

p23px

py

p30 −p1

p23py

1p3

0 −p1

p23

0 px

p3−p2

p23px 0

py

p3−p2

p23py 0 1

p3−p2

p23

],

(35)with p1 = a1px + a4py + a7, p2 = a2px + a5py + a8, andp3 = a3px+a6py +a9. The last term ∂X∗

k(u)∂u

∣∣u=0

of (34) is a9× 8 matrix each of whose columns is given by X∗

k ·Ei withEi in (4) represented as a vector in �9 by a.

VI. ENHANCEMENT OF ACCURACY AND EFFICIENCY

A. Iterative Gaussian Importance Function Generation

In [30], it was shown that Gaussian importance functionsby local linearization yields better results for affine motiontracking than the state transition density. However, suchGaussian importance functions would poorly approximate theoptimal importance function for projective motion tracking,whose warping function is nonlinear. Although the second-order approximation using the Hessian is possible [48], theexact Hessian computation for our case is time-consumingand its approximation as in [11] would not much enhancethe accuracy.

To increase the accuracy of Gaussian importance func-tions by local linearization, we can apply iteratively a locallinearization-based approximation such as the iterated ex-tended Kalman filter (EKF) [5]. However, when X∗

k venturesout of the basin of convergence, it can result in importancefunctions even worse than the state transition density becauseof possible divergence.

To overcome this divergence problem, we select the bestamong all Gaussian importance functions obtained at everyiteration by the following two measures:

s1 = yk − g(mk,j), (36)

s2 = v(log(X∗k−1mk,j)), (37)

where mk,j denotes the mean of the Gaussian importancefunction obtained at the j-th iteration. s1 represents how

dissimilar the predicted measurement of mk,j is to yk, while s2represents how far mk,j is from X∗

k . To balance the influencesof s1 and s2, we define a criterion C in the form of a Gaussianusing R in (10) and Q = PΔt in (22) as

C(j) = exp(− 1

2s�1 R

−1s1)exp

(− 1

2s�2 Q

−1s2). (38)

Among all Gaussian importance functions obtained at eachiteration, we select the best maximizing C. The selectedGaussian importance function can be intuitively understoodas the one most similar to yk while minimizing divergencefrom X∗

k . The algorithm is given in Algorithm 2.

Algorithm 2 An algorithm for selective iterative Gaussianimportance function generation1. Set mk,0 = X∗

k and Σk,0 = Q = PΔt.2. Calculate C(0) with mk,0 via (38).3. For j = 1, . . . , Niter,

a. Calculate the Jacobian Jj of yk at mk,j−1.b. Set μ1 = mk,j−1, μ2 = g(mk,j−1), Σ11 = Σk,j−1,

Σ12 = Σk,j−1J�j , and Σ22 = JjΣk,j−1J

�j +R.

c. Obtain mk,j and Σk,j via (30), (31), and (32).d. Calculate C(j) with mk,j via (38).

4. Determine the importance function as NSL(3)(mk,j ,Σk,j)with j = argmaxj∈{0,1,...,Niter} C(j).

In this case, w(i)k are calculated via (11) with

ρ(X(i)k |X(i)

k−1) =1√

(2π)8 det(Q)exp

(−1

2d�1 Q

−1d1

),

(39)where Q = Σk,j−1 and d1 = v(log(m−1

k,j−1X

(i)k )), in light of

the fact that ρ(X∗k) has been deterministically transformed to

NSL(3)(mk,j−1,Σk,j−1) before calculation of Jj .Note that in Algorithm 2, the uncertainty of μ1 decreases

as the iteration proceeds, because Σ11 at the j-th iteration isgiven by Σk,j−1. In the general iterated EKF, Σ11 is fixedto the initial value (Q in our case) because the decrease inthe determinant of Σ11 during iteration implies that there aremultiple yk’s at one time step, and it does not apply to generalcases. However, since yk in our framework is always givenby the value when Ik(W (p;Xk)) = T (p), we can freelyassume multiple yk’s at one time step and gradually decreasethe uncertainty of μ1. This can contribute to accurate trackingwith a limited number of particles.

Though the proposed iterative Gaussian importance functiongeneration method can approximate the optimal importancefunction more accurately than by local linearization, the re-quired iteration likely lowers overall computational efficiency.In the subsequent subsections, we describe a collective set ofmethods that together increase computational efficiency.

B. Inverse Formulation of Jacobian Calculation

In analogy with the inverse approach of [3], we utilize theinverse formulation of the Jacobian calculation. The idea isthat the derivative of g(T (p), Ik(W (p;X∗

k(u)))) with respectto u is equivalent to that of g(T (W (p;X0(−u))), Ik(p′))where X0 is the initial identity matrix.

8

Fig. 3. (a) The template resizing employed in our framework. A andp represent the affine transformation matrix for coordinate transformationand the transformed template coordinates, respectively. (b) Solid and dottedquadrilaterals represent the initial templates and the transformed templates byX1, respectively. The transformed shape of the blue template is different fromthat of the red one because of different template size despite the same state.

Denoting W (p;X0(−u)) by p′′u, the Jacobian J ofg(T (p′′u), Ik(p′)) at u = 0 is given by

Ji =∂gi(T (p), Ik(p′))

∂T (p)· ∇uT (p′′u), (40)

where ∇uT (p′′u) =∂T (p′′

u)∂u

∣∣u=0

. Note that ∇uT (p′′u) is con-stant and can be pre-computed because it only depends onT (p) and X0, which are always constant. To obtain Ji ofgi(T (p′′u), Ik(p′)), we only need to evaluate ∂gi(T (p),Ik(p

′))∂T (p)

with T (p) and Ik(p′).

C. Template Resizing

The computational complexity of our framework increaseswith the size of T (p) because the measurement processinvolves image interpolation. To avoid the increased computa-tional complexity with large templates, we resize the templateto the specified size as described in Fig. 3(a). Since theaffine transformation matrix A for coordinate transformationonly affects the actual image interpolation, all calculationsincluding the Jacobian can be done on the transformed tem-plate coordinates p without additional care related to templateresizing.

The template resizing in our framework has further sig-nificance beyond just reducing computational complexity.Consider two initial templates with different sizes shown inFig. 3(b) in solid lines. The dotted quadrilaterals in Fig. 3(b)represent the transformed templates with X1 =

[1 0 00 1 0

0.005 0 1

].

The shapes of transformed templates are different from eachother despite the same state. This means we should set thecovariance P in (7) differently according to the template sizeeven with the same expected object movement. By resizing thetemplate, we can set P reasonably according to the expectedobject movement regardless of the original template size. Toreduce the computational complexity, one can alternatively usea subset of template pixels as in [13]. However, it still suffersfrom the problem described in Fig. 3(b) and thus is not suitablefor our framework.

D. Use of Parent-Child Particles

To reduce computational complexity further, we propose touse parent-child particles as described in Fig. 4. In Fig. 4,blue circles denote N parent particles X

(i)k to which the

Fig. 4. The use of parent-child particles using residual systematic resampling.Blue and green circles represent parent and child particles, respectively. Theiterative Gaussian importance function generation is only applied to parentparticles.

iterative Gaussian importance function generation is applied.Green circles are Nc child particles X(i,j)

k sampled from eachGaussian importance function N

(i)SL(3) for each parent particle

X(i)k . Thus, N · Nc particles are sampled while the iterative

Gaussian importance function generation is applied only Ntimes. After weighting all N ·Nc particles, we resample onlyN parent particles using residual systematic resampling [10],which is most suitable when the number of particles variesduring resampling.

Theoretically, it is possible to reduce the number of particlesduring resampling as long as the true filtering density is wellrepresented by a reduced set of particles. Our underlyingassumption about the use of parent-child particles is that thetrue filtering density will have a single mode for projectivemotion tracking of a single object. If a density is single-mode,it can be represented well by smaller number of particles thanthe case of multi-mode distributions.

One of empirical measures for how well a probabilitydensity is represented by particles and their weights is thenumber of effective particles defined as Neff =

∑i[(w

(i)k )2]−1

[15], where w(i)k are the normalized weights. The number of

effective particles varies between one (in the worst case) andthe total number of particles (in the ideal case). If the numberof effective particles does not change much by the use of theproposed parent-child particles, it can be said that the parent-child particles does not alter the statistical property of particlefiltering. In Section VII-B, we experimentally verify that theuse of parent-child particles is effective in reducing compu-tational complexity without altering the statistical property ofparticle filtering by showing the effective numbers of particlesfor different pairs of N and Nc of the same total number ofparticles are consistent.

Now we present the overall algorithm of projective motiontracking via particle filtering on SL(3) with Gaussian impor-tance functions in Algorithm 3.

9

Algorithm 3 An algorithm for projective motion tracking viaparticle filtering on SL(3) with Gaussian importance functions1. Initialization

a. Set k = 0.b. Set number of parent and child particles as N and Nc.c. For i = 1, . . . , N ,

– Set X(i)0 = I , A(i)

0 = 0.– For j = 1, . . . , Nc, set w(i,j)

0 = 1N·Nc

.d. Compute ∇uT (p′′u).

2. Importance sampling stepa. Set k = k + 1.b. For i = 1, . . . , N ,

– Compute N(i)

SL(3)(mk,Σk) via Algorithm 2 with ∇uT (p′′u).– For j = 1, . . . , Nc,

: Draw X(∗i,j)k ∼ N

(i)

SL(3)(mk,Σk) via (21) and compute

A(∗i,j)k with (8).

: Calculate the weights w(i,j)k via (11) and (39).

c. Normalize the weights as w(i,j)k =

w(i,j)k

∑i

∑j w

(i,j)k

.

3. Selection step (resampling)a. Resample using residual systematic resampling from

{X(∗i,j)k , A

(∗i,j)k , i = 1, . . . , N, j = 1, . . . , Nc} according to

{w(1,1)k , . . . , w

(N,Nc)k } to obtain i.i.d.

{X(i)k , A

(i)k , i = 1, . . . , N}.

b. Calculate Xk from {X(1)k , . . . , X

(N)k } via Algorithm 1.

c. For i = 1, . . . , N ,– For j = 1, . . . , Nc, set w(i,j)

k = 1N·Nc

.4. Go to the importance sampling step

E. Adaptation to Affine Motion Tracking

We now show how to adapt our projective motion trackingframework to the group of affine transformations. The majordifference in performing affine motion tracking via particlefiltering on Aff (2) is in the warping function W (p;Xk).All other details, such as the state equation and Gaussianimportance functions on Aff (2), can be easily adapted fromthe SL(3) case.

The warping function W (p;Xk) for an affine transfor-

mation with Xk =[a1 a3 a5a2 a4 a60 0 1

]is defined as W (p;Xk) =[

a1px+a3py+a5

a2px+a4py+a6

]. Accordingly, the second term ∂p′

∂X∗k

of (34)

for Aff (2) is given by differentiating W (p;X∗k) with respect

to a = (a1, . . . , a6)�, with X∗

k defined as X∗k =

[a1 a3 a5a2 a4 a60 0 1

];

the result is[px 0 py 0 1 00 px 0 py 0 1

]. The last term of (34) for Aff (2)

becomes a 6× 6 matrix.We can also perform similarity motion tracking via par-

ticle filtering on Aff (2) by setting the state covariance asP = diag(σ2

1 , . . . , σ26), with nearly zero σ1 and σ4. Note that

this is not possible with particle filtering on SL(3) becauseisotropic scaling is not present in (4).

VII. EXPERIMENTS

In this section, we demonstrate the feasibility of our particlefiltering-based probabilistic visual tracking framework throughextensive experiments using various test video sequences. Forall test sequences, we perform projective motion tracking byparticle filtering on SL(3) with C implementation on a desktop

1029 1306

43 518 1376

93 401

17

86 331 406 413

77 340 414

397 556 715 47

45

(a)

(b)

(c)

(d)

(e)

7 49 71 78

(f)

Fig. 5. Tracking results by our framework for (a) “MousePad”, (b) “Towel”,(c) “Tea”, (d) “Bear”, (e) “Acronis”, and (f) “Book”, all of which containplanar object images. The red lines in (d) and (e) represent the results byour framework when “using only NCC” and “without outlier detection”,respectively.

162 393 520 1091

45 119 257 361

100 269 351 592

30 126 244 397

(a)

(b)

(c)

(d)

Fig. 6. Tracking results by our framework for (a) “Paper” and (b)“Magazine”, which contain non-rigid object images, and (c) “Doll” and (d)“David”, which contain rigid non-planar object images.

PC equipped with Intel Core i-7 2600K CPU and 16 GB RAM.The experimental results can be also seen in the supplementaryvideo2.

A. Overall Tracking Performance

In order to demonstrate the overall performance of theproposed tracking framework, we use the following videosequences of planar objects: “Towel” and “MousePad” from[65], “Bear” and “Book” from [50], and “Tea” and “Acronis”

2The video is also available at http://cv.snu.ac.kr/jhkwon/tracking video.avi.

10

of our own. For non-rigid (originally) planar objects, we use“Paper” from [50] and “Magazine” from [39] while we use“Doll” and “David” from [46] for rigid non-planar objects.

We crop the initial object template Tinit(p) from the firstframe manually and resize it to a size of 40 × 40. We set(N,Nc) = (40, 10) for parent-child particles and Niter = 5for the iterative Gaussian importance function generation. ThePCA bases are incrementally updated at every fifth frameafter the initial PCA basis extraction from the first 15 frames.The state covariance P and measurement covariance R aretuned specifically for each sequence to yield satisfactorytracking results. The parameters tuned for each sequence aresummarized in the separate supplementary document. Our Cimplementation with this setting runs in about 15 fps. By theinverse formulation of the Jacobian calculation, the trackingspeed is increased from 8 fps to 15 fps. There is no noticeableperformance degradation by template resizing.

Figure 5 shows the tracking results by our framework forthe test sequences of planar objects. Please also watch thesupplementary video for better understanding of the trackingperformance of our framework. Fig. 5(a) and Fig. 5(b) showthat our framework accurately tracks the planar objects in“MousePad” and “Towel” despite the motion blur caused byfast and large inter-frame object motion. Our framework isalso able to track the planar object in “Tea” consistentlydespite temporal occlusion as Fig. 5(c) shows. The robustnessto motion blur and temporal occlusion can be considered asoriginating from the stochastic nature of particle filtering.

The effect of conjunctive use of NCC and PCA in ourmeasurements can be verified from the results for “Bear” inFig. 5(d). When using only gNCC for the measurement (redlines in Fig. 5(d)), tracking failure occurs because of localillumination changes. On the contrary, our framework usingboth gNCC and gPCA (yellow lines in Fig. 5(d)) accurately tracksthe object by accounting for the local illumination changethrough an incrementally updated PCA model.

The effect of outlier detection can be identified from thetracking result for “Acronis” in Fig. 5(e). Without outlierdetection (red lines in Fig. 5(e)), our framework fails to trackthe object because of large NCC error induced by severe localillumination changes. On the contrary, by detecting outlierswhose PCA reconstruction error is above the threshold andexcluding them in calculating gNCC (yellow lines in Fig. 5(e)),our framework is able to successfully track the object.

Fig. 5(f) shows the tracking failure for “Book” even withthe conjunctive use of NCC and PCA with outlier detection. In“Book”, severe specular reflection changes the object appear-ance so rapidly that even the incremental PCA model cannotcapture such appearance changes adequately, which leads tothe tracking failure.

Although our framework is inherently best suited for track-ing rigid planar objects, it can also yield satisfactory trackingresults up to a certain extent for non-rigid and non-planarobject cases. Fig. 6(a) and Fig. 6(b) show the tracking resultsfor “Paper” and “Magazine” in which originally planar objectsare being deformed non-rigidly. For both sequences, ourframework estimates planar homographies locally optimal inminimizing the difference to the initial object template. We

(a) “Mascot” (b) “Bass” (c) “Cube”

Fig. 7. Test video sequences used in Section VII-B and Section VII-C. Thered crosses are the ground truth object corner points manually extracted.

can confirm that such homography estimation is reasonablyacceptable via a visual check of the tracking results in thesupplementary video.

Fig. 6(c) and Fig. 6(d) show the tracking results for“Doll” and “David” in which rigid non-planar objects arechanging their poses frequently under the varying illuminationcondition. In spite of appearance changes owing to objectpose and illumination changes, our framework estimates theobject image region quite accurately for both sequences. Thesesuccessful tracking results can be mainly attributed to gPCA inthe measurement as such appearance changes are captured bythe incrementally updated PCA bases. The tracking results byour framework for “Doll” and “David” are in good agreementwith the results in [46] where the incremental PCA model forvisual tracking is originally proposed.

B. Tracking Performance Analysis with Parameter Variations

In this subsection, we analyze the tracking performance ofour framework with parameter variations. Fixing the templatesize and Niter to 40×40 and five, respectively, we vary (N,Nc)for parent-child particles, state covariance P , and measurementcovariance R. For this experiment, we use our own videosequences, “Mascot”, “Bass”, and “Cube” shown in Fig. 7,with the ground truth corner points extracted manually. Sincethere is little local illumination change in the test sequences,we use only gNCC for the measurement in this experiment tosimplify the performance analysis.

In order to analyze the tracking performance quantitatively,we consider the followings: “Time”, “Error”, “Std”, and Neff.“Time” is the average computation time per frame in millisec-onds. With the ground truth corner points cjk and estimatedcorner points cjk at time k, we calculate the RMS error as

ek =√

14

∑4j=1 |c

jk − cjk|2/sk, where sk is the ground truth

object area at time k. The purpose of error normalizationby the object area is to prevent the error from being scaledaccording to the object scale change. “Error” and “Std” arethe mean and standard deviation of ek. As aforementioned,Neff represents how well a probability density is representedby particles and their weights. The larger Neff is, the stablerthe tracking is.

We first vary (N,Nc) for parent-child particles, which arethe most important factors for tracking performance in prac-tice. We fix P and R to the values to yield satisfactory trackingresults for each sequence. P and R tuned for each sequencecan be found in the separate supplementary document. Thereare 12 combinations of (N,Nc) corresponding to 200, 400,and 600 particles in total.

11

TABLE ITRACKING PERFORMANCE COMPARISON BETWEEN DIFFERENT COMBINATIONS OF (N,Nc) FOR PARENT-CHILD PARTICLES.

Mascot

(N,Nc) (200,1) (40,5) (20,10) (10,20) (400,1) (80,5) (40,10) (20,20) (600,1) (120,5) (60,10) (30,20)Time 25.61 14.99 11.78 9.76 50.92 29.74 23.47 19.31 76.16 44.42 35.10 28.96Error 0.0260 0.0262 0.0266 0.0280 0.0251 0.0253 0.0253 0.0263 0.0249 0.0249 0.0251 0.0255Std 0.0055 0.0058 0.0059 0.0071 0.0050 0.0051 0.0053 0.0057 0.0049 0.0049 0.0051 0.0054Neff 29.41 29.59 30.10 30.41 54.02 54.09 54.54 54.79 76.55 76.59 77.29 78.89

Bass


Cube


0.024 0.025 0.026 0.027 0.028 0.0290

50

100

Tracking Error

Com

puta

tion

Tim

e/F

ram

e (m

sec)

0.038 0.039 0.04 0.041 0.042 0.0430

50

100

150

Tracking Error

Com

puta

tion

Tim

e/F

ram

e (m

sec)

0.017 0.018 0.019 0.02 0.021 0.0220

50

100

Tracking Error

Com

puta

tion

Tim

e/F

ram

e (m

sec)

0.0170.0180.0190.020.0210.0220

50

100

Tracking Error

Com

puta

tion

Tim

e/F

ram

e (m

sec)

N=200, Nc=1

N=40, Nc=5

N=20, Nc=10

N=10, Nc=20

N=400, Nc=1

N=80, Nc=5

N=40, Nc=10

N=20, Nc=20

N=600, Nc=1

N=120, Nc=5

N=60, Nc=10

N=30, Nc=20

(a) “Mascot” (b) “Bass” (c) “Cube”

Fig. 8. Scatter plots of “Error” and “Time” for each combination of (N,Nc) in Table. I. The legend for (a) applies to all sub figures.

Table I shows combinations of (N,Nc) and correspondingtracking performance. The numbers in Table I are obtainedby averaging over 10 independent runs for each combinationof (N,Nc). It is clear from Table I that as the total numberof particles increases, “Error” and “Std” decrease along withthe increase of “Time”. On the other hand, within the sametotal number of particles, there is a trend that as Nc increases,“Error” and “Std” increase along with the decrease of “Time”3.

We can verify the effectiveness of the use of parent-childparticles by specifically comparing the cases of (N,Nc) =(400, 1) and (N,Nc) = (60, 10). For all test sequences,“Time” for (N,Nc) = (60, 10) is much less than that for(N,Nc) = (400, 1) while both cases yield similar “Error”and “Std”. This implies that we can enhance the computa-tional efficiency considerably without sacrificing the trackingaccuracy by employing the parent-child particles. It is alsoworth to note that in Table I the values of Neff for differentvalues of Nc are quite similar within the same total numberof particles, i.e., the same N ·Nc. This means that the use ofparent-child particles does not alter the statistical property ofparticle filtering and tracking stability.

It is necessary in practice to find the appropriate combi-nation of (N,Nc) that balances the tracking accuracy andcomputation time. Fig 8 shows the scatter plots of “Error”and “Time” for each combination of (N,Nc). From Fig. 8,we can identify that our choice of (N,Nc) = (40, 10) for

3Note that the tracking speed gain is not strictly inversely proportionalto the number of parent particles, because particles copied from the sameparticle during resampling can share the same importance function withoutre-computation.

TABLE IITRACKING PERFORMANCE COMPARISON BETWEEN DIFFERENT STATE

COVARIANCES.

Mascot

P P1 P2 P3 P4 P5

Error 0.0271 0.0261 0.0254 0.0252 0.0249Std 0.0061 0.0056 0.0056 0.0052 0.0051Neff 57.26 56.94 54.55 51.24 47.35

Bass

P P1 P2 P3 P4 P5

Error 0.0392 0.0391 0.0393 0.0394 0.0396Std 0.0086 0.0082 0.0080 0.0076 0.0074Neff 107.65 103.92 97.39 91.32 85.43

Cube

P P1 P2 P3 P4 P5

Error 0.0199 0.0192 0.0191 0.0184 0.0189Std 0.0057 0.0055 0.0054 0.0051 0.0055Neff 87.98 89.11 88.16 87.35 83.91

the experiments in Section VII-A is reasonable because greencircles in Fig. 8 representing the cases of (N,Nc) = (40, 10)can be regarded as points of good compromise between“Error” and “Time”.

We now check the tracking performance changes accordingto the changes of the state and measurement covariances, Pand R. For each sequence, we vary P tuned previously asP1 = 0.82P , P2 = 0.92P , P3 = P , P4 = 1.12P , andP5 = 1.22P and similarly for R. We set (N,Nc) = (40, 10)for this experiment. Table II and Table III summarize thetracking performance comparison between different state andmeasurement covariances. The numbers in Table II and Ta-ble III are obtained by averaging over 10 independent runsfor each case.

It is hard to find out certain trends from Table II except

12

TABLE IIITRACKING PERFORMANCE COMPARISON BETWEEN DIFFERENT

MEASUREMENT COVARIANCES.

Mascot

R R1 R2 R3 R4 R5

Error 0.0231 0.0241 0.0255 0.0269 0.0286Std 0.0047 0.0050 0.0054 0.0056 0.0063Neff 42.06 49.05 54.01 59.91 64.08

Bass

R R1 R2 R3 R4 R5

Error 0.0365 0.0380 0.0393 0.0404 0.0421Std 0.0070 0.0077 0.0081 0.0083 0.0090Neff 78.59 88.87 97.79 105.61 112.14

Cube

R R1 R2 R3 R4 R5

Error 0.0169 0.0182 0.0187 0.0198 0.0211Std 0.0045 0.0049 0.0054 0.0058 0.0063Neff 79.15 84.01 87.94 90.40 95.03

one that Neff decreases as P increases. This can be intuitivelyunderstood: a larger P will result in larger variations ofsampled SL(3) particles and accordingly smaller number ofsampled particles will be similar to the ground truth. Inpractice, it is important to set P to cover a possible expectedrange of the object motion. As we show in Fig. 3, our templateresizing approach is important in setting P correctly regardlessof the initial size of the object image. Note that the valuesof P for various test video sequences given in the separatesupplementary document are in very similar ranges.

There is a clear trend in Table III that “Error”, “Std”,and Neff all increase as R increases. This can be intuitivelyunderstood: if R is set to the very large value, all particleswill have essentially the same weights regardless of theirsimilarity to the ground truth. In this case, a larger Neff doesnot necessarily result in more stability in tracking. It shouldbe understood that the comparison of different values of Neff

is only meaningful with the same P and R. In practice, weshould set R to balance the tracking accuracy and stability.The difficulty is that the proper value of R depends on theobject image characteristics as identified from the separatesupplementary document. Therefore, it would be an interestingtopic to find a way to set a proper R automatically based onobject image properties such as gradient and texture.

C. Comparison with Different Importance Functions

We now compare the performance of our Gaussian impor-tance functions with that of the state transition density (“STD”)and the Gaussian importance functions by local linearization(“LL”), which is employed in [30]. We implement “LL” alsowith the inverse formulation of the Jacobian calculation tomake its computation time comparable to ours. We use thesame test sequences as in Section VII-B. For “STD” and “LL”,we set N as 400, 800, and 1200 without the use of parent-child particles while we set (N,Nc) = (40, 10) for ours. Allthe remaining parameters are the same as in Section VII-B.

Table IV summarizes the performance comparison betweendifferent importance functions. The numbers in Table IV areobtained by averaging over 10 independent runs for each case.For “STD” and “LL”, it is clear that as N increases, “Error”and “Std” decrease along with the increase of “Time” andNeff. Comparing the cases of “STD” and “LL” with the sameN , we can confirm from the decreased “Error” and “Std” and

(a) “Mascot”

0.02 0.03 0.04 0.050

20

40

60

80

Tracking Error

Com

puta

tion

Tim

e/F

ram

e (m

sec)

0.04 0.05 0.060

20

40

60

80

Tracking Error

Com

puta

tion

Tim

e/F

ram

e (m

sec)

0.02 0.030

20

40

60

80

Tracking Error

Com

puta

tion

Tim

e/F

ram

e (m

sec)

0.02 0.03 0.04 0.050

20

40

60

80

Tracking Error

mpu

tatio

n T

ime/

Fra

me

STD, N=400STD, N=800STD, N=1200LL, N=400LL, N=800LL, N=1200Ours, N=40, N

c=10

(a) “Bass” (c) “Cube”

Fig. 9. Scatter plots of “Error” and “Time” for each case in Table IV. Thelegend for (a) applies to all sub figures.

(a) “Low” (b) “Repetitive”

(c) “Normal” (d) “High”

Fig. 10. Target objects in the Metaio benchmark dataset [35]. There are fourgroups of objects classified according to the object texture property. For eachobject, there are five different video sequences focusing on different typesdynamic behaviors.

increased Neff that the tracking performance is enhanced bythe local linearization-based Gaussian importance function.

However, “LL” fails to outperform our framework for alltest sequences in terms of “Error” and “Std”. Note that ourframework yields the least “Error” and “Std” for all testsequences with the greatest Neff. It is remarkable that Neff

for our framework with 400 particles in total is even greaterthan that for “LL” with 1200 particles.

Fig 8 shows the scatter plots of “Error” and “Time” foreach case. We can infer from Fig. 8 that “STD” and “LL”with the same “Time” as ours would yield greater “Error”than ours, and they would require much greater “Time” toyield the same “Error” as ours. This can be better illustratedby the supplementary video, which shows tracking results for“Mascot”, “Bass”, and “Cube” by each importance functionwith the similar computational complexity, i.e., N = 1200 for“STD”, N = 600 for “LL”, and (N,Nc) = (40, 10) for ours.The result videos clearly indicate that our framework yieldsmore accurate and stabler tracking results than “STD” and“LL” when run with the similar computational complexity.

D. Benchmark Test Using Metaio Dataset

We finally evaluate the tracking performance of our frame-work using the Metaio benchmark dataset [35], which has beendeveloped to evaluate template-based tracking algorithms. Thebenchmark dataset is composed of the following four groupsclassified according to the object texture property: “Low”,“High”, “Repetitive”, and “Normal”. Each group possessestwo different objects as shown in Fig. 10. For each object,there are five different video sequences focusing on differenttypes of dynamic behaviors, namely, “Angle” for the camera

13

TABLE IVTRACKING PERFORMANCE COMPARISON BETWEEN DIFFERENT IMPORTANCE FUNCTIONS.

Mascot

Importance Function STD (400) STD (800) STD (1200) LL (400) LL (800) LL (1200) Ours (40,10)Time 14.77 29.03 43.57 20.65 41.16 61.71 23.47Error 0.0507 0.0416 0.0347 0.0353 0.0299 0.0283 0.0253Std 0.0241 0.0202 0.0116 0.0150 0.0084 0.0070 0.0053Neff 4.17 6.51 8.71 7.61 13.45 19.10 54.54

Bass


Cube


viewing angle change, “Range” for the object image sizechange, “FastFar” for the fast moving object far from thecamera, “FastClose” for the fast moving object close tothe camera, and “Illumination” for the illumination change.Therefore, there are 40 test video sequences in total in thebenchmark dataset.

For each sequence, the initial coordinates of four cornerfiducial points outside the object image region are given. Thegoal is to estimate those fiducial point coordinates for theremaining frames. We manually extract the object templatefrom the initial frame and perform projective motion trackingfor the remaining frames. Then we estimate the fiducial pointcoordinates by transforming their initial coordinates using theestimated homographies. In [35], the tracking at a specificframe is considered to be successful if the root-mean-square(RMS) error of the estimated fiducial point coordinates is lessthan 10 pixels. There is no consideration of the object scalechange in calculating the tracking error.

For experiments using the Metaio benchmark dataset, weset (N,Nc) = (80, 10) to increase the tracking accuracy withthe reasonable computational complexity. The template size is40×40 and Niter is set as five as before. The state covariance Pand measurement covariance R are tuned specifically for eachsequence to yield satisfactory tracking results. The parametersettings for each sequence are summarized in the separatesupplementary document. Although the video sequences in theMetaio benchmark dataset are in color, we use their grayscaleversions.

We compare our tracking performance with five differenttemplate tracking methods: the ESM tracker based on nonlin-ear optimization [8], the L1 tracker based on particle filtering[4], and motion estimation by key-point matching using SIFT[36], SURF [6], and FERNS [40]. The ESM tracker and L1tracker are generally considered as the state-of-the-art trackingmethods in their own categories, i.e., nonlinear optimizationand particle filtering. The main strength of the ESM tracker isa large basin of convergence achieved by using both forwardand inverse gradients of the object template while that of theL1 tracker is the robustness to object appearance changesachieved by the advanced object appearance model basedon sparse representation. SIFT, SURF, and FERNS are also

widely used for key-point matching because of their ability toextract distinctive local features stably.

We reuse the tracking results of the ESM tracker and key-point matching-based motion estimation methods reported in[35] without re-running them. For the L1 tracker, we use thesource code provided by the authors of [4]. In order to achievehigher tracking accuracy of the L1 tracker, we use 2000particles, which is rather many, with the state covariance valuessomewhat increased from the default and tuned differently foreach sequence. The other parameters remain unchanged fromthe default.

It is rather meaningful to compare our tracking performancewith the ESM tracker because it also treats SL(3) as anoptimization variable. That is, it is a comparison betweendeterministic and probabilistic template-based visual trackingalgorithms employing the same Lie group-based representa-tion. The comparison with the ESM tracker can verify ourclaim that the well-designed particle filtering can be morerobust to the problem of local optima than the deterministicoptimization.

It is also meaningful to compare our tracking performancewith the L1 tracker because it also employs particle filteringas the state estimation tool as ours but only with the statetransition density-based importance function. Furthermore, theL1 tracker can estimate the object motion up to affine trans-formations. Therefore, the comparison with the L1 trackercan verify whether our two contributions, namely the use ofthe Gaussian importance functions and the extension of ouraffine motion tracking framework of [30] to projective motiontracking, are really important for highly accurate templatetracking.

1) Overall Performance: Bar graphs in Fig. 11 summarizethe evaluation of tracking results by each tracking method interms of the tracking success rate. Table V shows the aver-age tracking success rates for each specific sub-groups. Ourframework yields higher tracking success rates for “Repetitive”and “Normal” than for “Low”. It is rather surprising that thesuccess rate for “High” is the lowest. Fig. 11(d) tells that sucha low success rate for “High” is due to the quite low successrate for the grass image in Fig. 10(d) in which no distinctstructure exists.

On the other hand, our framework yields higher tracking

14

Angle Range FastFar FastClose Illum. Average0

50

100Low #1

Succ

ess

Rate

(%)


50

100Low #2

Succ

ess

Rate

(%)

FERNSSIFTSURFESML1Ours


50

100Repetitive #1

Succ

ess

Rate

(%)


50

100Repetitive #2

Succ

ess

Rate

(%)


50

100Normal #1

Succ

ess

Rate

(%)


50

100Normal #2

Succ

ess

Rate

(%)


50

100High #1

Succ

ess

Rate

(%)


50

100High #2

Succ

ess

Rate

(%)

(a) “Low”

(b) “Repetitive”

(c) “Normal”

(d) “High”

Fig. 11. Evaluation of tracking results for the Metaio benchmark dataset by FERNS, SIFT, SURF, ESM, L1, and our framework in terms of the trackingsuccess rate. The legend for “Low #1” in (a) applies to all sub figures.

TABLE VTRACKING PERFORMANCE COMPARISON BETWEEN FERNS, SIFT, SURF,

ESM, L1, AND OUR FRAMEWORK FOR THE METAIO BENCHMARK

DATASET. THE NUMBERS ARE AVERAGE TRACKING SUCCESS RATES FOR

EACH SPECIFIC SUB-GROUPS.

Method Low Repetitive Normal High Avg.

FERNS 29.49 48.26 58.61 43.32 44.92SIFT 49.77 71.49 67.86 68.26 64.35SURF 19.56 33.38 50.18 24.66 31.95ESM 57.78 28.78 71.62 30.94 47.28L1 21.19 22.69 20.08 20.43 21.10

Ours 67.81 84.58 81.95 63.76 74.53Method Angle Range FastFar FastClose Illum.

FERNS 44.78 52.84 12.43 44.30 70.24SIFT 59.33 87.46 24.69 56.62 93.66SURF 27.11 41.11 6.27 38.32 46.92ESM 69.98 57.05 15.11 41.72 52.53L1 8.05 38.80 21.11 5.54 31.99

Ours 88.36 81.22 66.56 43.48 93.00

success rates for “Angle”, “Range”, and “Illumination” thanfor “FastFar” and “FastClose”. The reason of the lower successrate for “FastFar” is severe motion blur caused by fast camera

motion while that for “FastClose” is partially visible objectsdue to the closeness to the camera. The higher success trackingrates for “Illumination” implies that our measurement usingNCC and PCA can properly deal with illumination changes.

In comparison with other template tracking methods, ourframework yields the highest tracking success rate averagedover all 40 test sequences as Table V shows. Specifically,our framework yields the highest tracking success rates for“Low”, “Repetitive”, “Normal”, “Angle”, and “FastFar” whileour framework is the second best for “High”, “Range”, and“Illumination”, following SIFT. The reason of the best trackingsuccess rates of SIFT for “Range” and “FastClose” is itestimates homographies only using visible key-points unlikeour framework. Overall, the performance ranking is as fol-lows: ours (74.53%), SIFT (64.35%), ESM (47.28%), FERNS(44.92%), SURF (31.95%), and L1 (21.10%).

2) Comparison with ESM Tracker: Our framework consis-tently outperforms the ESM tracker for all sub-groups. Theperformance difference for “Repetitive” and “FastFar” wellcharacterizes the different inherent properties of deterministicoptimization and stochastic particle filtering. The lower track-

15

ing success rate of the ESM tracker for “Repetitive” is due tobeing trapped in the local optima with similar appearanceswhile our framework suffers less from such local optimaproblems because of the stochastic nature of particle filtering.Although both methods do not employ a specific model ofmotion blur, our framework yields far better tracking successrates than the ESM tracker for “FastFar” where severe motionblur exists. This can be also attributed to the stochastic natureof particle filtering.

3) Comparison with L1 Tracker: Our framework also con-sistently outperforms the L1 tracker for all sub-groups withconsiderable margins. The average success rate of our frame-work is 74.53% while that of the L1 tracker is 21.10%. Whencompared with our framework, the main reason of the poorperformance of the L1 tracker is it can estimate the objectmotion only up to affine transformations. Among “Angle”,“Range”, “FastFar”, “FastClose”, and “Illumination”, the L1tracker performs worst for “Angle”, where the perspectiveeffect is severe due to large camera viewing angle changes.Conversely, the L1 tracker performs best for “Range”, wherethe camera viewing angle change is more limited than othercases.

Another reason for the poor performance of the L1 trackeris it employs the state transition density-based importancefunction unlike our framework. Only considering the frameswhere the perspective effect is not severe, the qualitativetracking performance of the L1 tracker is far better than thenumbers in Table V suggest, i.e., it tracks the object withseemingly good accuracy, but the RMS tracking errors areoften greater than the threshold (10 pixels). We can enhancethe tracking accuracy of the L1 tracker with the state transitiondensity-based importance function by using more particles,e.g., 5000, but it is not practical considering the computationalcomplexity. The average tracking error of the L1 tracker forthe successfully tracked frames is 7.02 while that of ourframework is 4.88. These suggest that the state transitiondensity-based importance function employed by the L1 trackeris not suitable for applications where high tracking accuracy ishighly required such as augmented reality and visual servoing.

VIII. DISCUSSION

From experiments using various test video sequences, wehave demonstrated that our probabilistic template-based visualtracking framework can provide quite accurate tracking re-sults with sufficient computational efficiency. Compared with[30], such satisfactory performance is largely derived froma collection of techniques, including the iterative Gaussianimportance function generation, inverse formulation of theJacobian calculation, template resizing, and use of parent-childparticles.

The benchmark test using the Metaio dataset clearly demon-strates the practical value of our framework for applicationssuch as visual servoing, augmented reality, and visual localiza-tion and mapping, where accurate projective motion trackingis necessary for 3-D camera motion estimation. Note thatthe Metaio benchmark dataset has been developed specificallyin consideration of augmented reality applications. We have

shown that our framework yields far better tracking perfor-mance for the Metaio benchmark dataset than the trackingmethods currently popularly used for augmented reality ap-plications, namely, the deterministic optimization-based ESMtracker and motion estimation by key-point matching usingSIFT, SURF, and FERNS.

The extension of our previous affine motion tracking al-gorithm of [30] to projective motion tracking is anotherimportant factor for the satisfactory tracking performance forthe Metaio benchmark dataset. When the perspective effectis not perceivable (e.g., when the object is far enough fromthe camera or its size is small), an affine motion tracker canestimate the object template motion rather accurately. How-ever, in most augmented reality applications usually targetedfor indoor environments, the perspective effect nearly alwaysexists as shown by our test video images in Fig. 5 and theMetaio benchmark dataset images. This also holds true forthe cases of visual servoing and indoor visual localizationand mapping. Affine motion tracking should not be used forsuch applications where the 3-D camera motion is estimatedfrom tracking results because of its limited ability to estimatethe object template motion accurately. Therefore, it should beregarded that what we have done to our previous work [30]in this paper, i.e., the extension of the affine motion track-ing framework to projective motion tracking with increasedcomputational efficiency and tracking accuracy, has not onlytheoretical merits but also considerable practical importance.

Especially for visual tracking, it is difficult to model the ob-ject dynamics precisely for freely moving objects. Due to thisreason, it has been the focus to improve the object appearancemodel in improving the performance of particle filtering-basedvisual tracking, e.g., [46], [59], [23], [4]. However, we arguethat the improvement of particle sampling other than the objectappearance model is also important especially for projectivemotion tracking. The degree of freedom of projective motion iseight rather higher than those of other forms of motion suchas similarity and affine. When we sample particles naivelyusing the state transition density-based importance function,the number of particles necessary to get satisfactory resultsincreases with the state dimension exponentially. Therefore,the improvement of particle sampling is essential for thepracticality of projective motion tracking via particle filtering.The poor performance of the L1 tracker of [4] for the Metaiobenchmark dataset also suggests that even with the advancedobject appearance model, the naive particle sampling usingthe state transition density-based importance function can leadto tracking inaccuracy. In this paper, we have considerablyimproved particle sampling by the proposed iterative Gaussianimportance function generation.

The way to represent the state also affects the trackingperformance considerably. In this paper, we have not pre-sented a performance comparison between our Lie group-based state representation and a conventional vector space one.The superiority of our Lie group-based state representation toa conventional vector space one has been well demonstratedboth conceptually and experimentally in [31] and [30], and thereader is referred to these sources for further details.

The question of how to design an image similarity measure

16

robust to various challenging causes such as motion blur, tem-poral occlusion, and illumination change is obviously relevantand important to template-based visual tracking performance.However, the design of image similarity measures is not the fo-cus of this paper. Instead, we have shown how well-establishedimage similarity measures suitable to template-based visualtracking, such as NCC and PCA reconstruction error, caneasily fit into our framework and yield satisfactory trackingperformance. The only restriction of our framework on imagesimilarity measures is that they should be differentiable, andthat the derivatives can be obtained at least approximately. Thepractical performance shall be dependent on the linearity ofthe employed similarity measure. In this sense, the entropy-based similarity measures such as mutual information [43] anddisjoint information [55] can be also used for measurements inour framework because their derivatives can be approximatelyobtained [11], [16].

We have not addressed the occlusion issue explicitly inthis paper, since the stochastic nature of particle filtering candeal (at least partly) with short-term and partial occlusion, asexperimentally shown in Section VII-A and previously in [46],[31]. One can apply more specific strategies such as the robusterror norm [9] and the sub-templates [27] to our frameworkin order to effectively deal with long-term occlusion.

In [50], as an efficient alternative to global optimizationstrategies, it is suggested to run the Kalman filter as a pre-processing step to give a reasonable initial point for optimiza-tion. Similarly, to refine our tracking results we can run anoptimization procedure as a post-processing step using theestimated state as the initial point. We can also considerfusion with the detection-based approach such as [21] tobe able to escape from temporal tracking failure. Althoughthese conjunctive uses of different kinds of methods fortemplate-based visual tracking may not be so appealing from atheoretical point of view, in practice they can potentially resultin considerable gains both in accuracy and efficiency.

IX. CONCLUSION

In this paper, we have proposed a novel particle filteringframework to solve template-based visual tracking probabilis-tically. We have formulated particle filtering on SL(3) withappropriate state and measurement equations for projectivemotion tracking that does not require vector parametrizationof homographies. In order to utilize the optimal importancefunction, we have clarified the notions of Gaussian distribu-tions and Taylor expansion on SL(3) using the exponential co-ordinates, and approximated the optimal importance functionas a Gaussian distribution on SL(3) by local linearization. Wehave also proposed an iterative Gaussian importance functiongeneration method to increase the accuracy of Gaussian impor-tance functions. The computational efficiency is enhanced bythe inverse formulation of the Jacobian calculation, templateresizing, and use of parent-child particles. We have also ex-tended the SL(3) particle filtering framework to affine motiontracking on the group Aff (2). The improved performance ofour probabilistic tracking framework and superiority of ourGaussian importance functions to other importance functions

have been experimentally demonstrated. Finally, it has beenshown that our framework yields superior tracking results toseveral state-of-the-art tracking methods via experiments usingthe publicly available benchmark dataset.

REFERENCES

[1] E. Arnaud and E. Memin. Partial linear gaussian models for tracking inimage sequences using sequential monte carlo methods. Int. J. Comput.Vision, 74(1):75–102, 2007.

[2] S. Baker, A. Datta, and T. Kanade. Parameterizing homographies.Technical report, The Robotics Instutitue, Carnegie Mellon University,2006.

[3] S. Baker and I. Matthews. Lucas-kanade 20 years on: a unifyingframework. Int. J. Comput. Vision, 56(3):221–255, 2004.

[4] C. Bao, Y. Wu, H. Ling, and H. Ji. Real time robust l1 tracker usingaccelerated proximal gradient approach. In Proc. CVPR, 2012.

[5] Y. Bar-Shalom, T. Kirubarajan, and X.-R. Li. Estimation with Applica-tions to Tracking and Navigation. John Wiley and Sons, Inc., 2002.

[6] H. Bay, T. Tuytelaars, and L. Gool. Surf: Speeded up robust features.In Proc. ECCV, 2006.

[7] E. Begelfor and M. Werman. How to put probablilities on homographies.IEEE Trans. Pattern Anal. Mach. Intell., 27(10):1666–1670, 2005.

[8] S. Benhimane and E. Malis. Homography-based 2d visual tracking andservoing. Int. J. Rob. Res., 26(7):661–676, 2007.

[9] M. Black and A. Jepson. Eigentracking: robust matching and trackingof articulated obejcts using a view-based representation. Int. J. Comput.Vision, 26(1):63–84, 1998.

[10] M. Bolic, P. Djuric, and S. Hong. New resampling algorithms for particlefilters. In Proc. ICASSP, 2003.

[11] R. Brooks and T. Arbel. Generalizing inverse compositional and esmimage alignment. Int. J. Comput. Vision, 87(3):191–212, 2010.

[12] A. Chiuso and S. Soatto. Monte carlo filtering on lie groups. In Proc.IEEE CDC, pages 304–309, 2000.

[13] F. Dellaert and R. Collins. Fast image-based tracking by selective pixelintegration. In Proc. ICCV Workshop on Frame Rate Processing, 1999.

[14] A. Doucet, N. Freitas, and N. Gordon, editors. Sequential Monte Carlomethods in practice. Springer-Verlag New York, 2001.

[15] A. Doucet, S. Godsill, and C. Andrieu. On sequential monte carlosampling methods for bayesian filtering. Stat. Comput., 10:197–208,2000.

[16] N. Dowson and R. Bowden. Mutual information for lucas-kanadetracking (milk): an inverse compositional formulation. IEEE Trans.Pattern Anal. Mach. Intell., 30(1):180–185, 2008.

[17] U. Grenander. Probabilities on Algebraic Structures. New York: Wiley,1963.

[18] G. Hager and P. Belhumeur. Efficient region tracking with parametricmodels of geometry and illumination. IEEE Trans. Pattern Anal. Mach.Intell., 20(10):1025–1039, 1998.

[19] B. Hall. Lie Groups, Lie Algebras, and Representations: An ElementaryIntroduction. Springer, 2004.

[20] S. Hauberg, F. Lauze, and K. Steenstrup. Unscented kalman filtering onriemannian manifolds. J. Math. Imaging Vision, 46(1):103–120, 2013.

[21] S. Hinterstoisser, O. Kutter, N. Navab, P. Fua, and V. Lepetit. Real-timelearning of accurate patch retification. In Proc. CVPR, 2009.

[22] S. Holzer, S. Ilic, and N. Navab. Adaptive linear predictors for real-timetracking. In Proc. CVPR, 2010.

[23] W. Hu, X. Li, W. Luo, X. Zhang, S. Maybank, and Z. Zhang. Singleand multiple object tracking using log-euclidean riemannian subspaceand block-division appearance model. IEEE Trans. Pattern Anal. Mach.Intell., In press.

[24] M. Isard and A. Blake. Condensation–conditional density propagationfor visual tracking. Int. J. Comput. Vision, 29:5–28, 1998.

[25] S. Julier and J. Uhlmann. Unscented filtering and nonlinear estimation.Proc. IEEE, 92(3):401–422, 2004.

[26] F. Jurie and M. Dhome. Hyperplane approximation for templatematching. IEEE Trans. Pattern Anal. Mach. Intell., 24(7):996–1000,2002.

[27] F. Jurie and M. Dhome. Real time robust template matching. In Proc.BMVC, 2002.

[28] J. Kwon, M. Choi, F. Park, and C. Chun. Particle filtering on theeuclidean group: framework and applications. Robotica, 25(6):725–737,2007.

[29] J. Kwon and K. Lee. Monocular slam with locally planar landmarksvia geometric rao-blackwellized particle filtering on lie groups. In Proc.CVPR, 2010.

[30] J. Kwon, K. Lee, and F. Park. Visual tracking via geometric particlefiltering on the affine group with optimal importance functions. In Proc.CVPR, 2009.

17

[31] J. Kwon and F. Park. Visual tracking via particle filtering on the affinegroup. Int. J. Rob. Res., 29(2-3):198–217, 2010.

[32] K.-C. Lee, J. Ho, M.-H. Yang, and D. Kriegman. Visual tracking andrecognition using probabilistic appearance manifolds. Comput. VisionImage Understanding, 99(3):303–331, 2005.

[33] M. Li, T. Tan, W. Chen, and K. Huang. Efficient object tracking byincremental self-tuning particle filtering on the affine group. IEEE Trans.Image Process., 21(3):1298–1313, 2012.

[34] P. Li, T. Zhang, and A. Pece. Visual contour tracking based on particlefilters. Image Vision Comput., 21:111–123, 2003.

[35] S. Lieberknecht, S. Benhimane, P. Meier, and N. Navab. A dataset andevaluation methodology for template-based tracking algorithms. In Proc.ISMAR, 2009.

[36] D. Lowe. Distinctive image features from scale-invarinat keypoints. Int.J. Comput. Vision, 60(2):91–110, 2004.

[37] B. Lucas and T. Kanade. An iterative image registration technique withan application to stereo vision. In Proc. IJCAI, pages 674–679, 1981.

[38] R. Mahony and J. Manton. The geometry of the newton method onnon-compact lie groups. J. Global Optim., 23:309–327, 2002.

[39] E. Malis. Esm software development kits. fromhttp://esm.gforge.inria.fr/ESM.html.

[40] M. Ozuysal, P. Fua, and V. Lepetit. Fast keypoint recognition in tenlines of code. In Proc. CVPR, 2007.

[41] N. Paragios and R. Deriche. Geodesic active contours and level sets forthe detection and tracking of moving objects. IEEE Trans. Pattern Anal.Mach. Intell., 22(3):266–280, 2000.

[42] X. Pennec. Intrinsic statistics on riemannian manifolds: basic tools forgeometric measurements. J. Math. Imaging Vis., 25:127–154, 2006.

[43] J. Pluim, J. Maintz, and M. Viergever. Mutual-information-basedregistration of medical images: a survey. IEEE Trans. Pattern Anal.Mach. Intell., 22(8):986–1004, 2003.

[44] D. Ponsa and A. Lopez. Variance reduction techniques in particle-basedvisual contour tracking. Pattern Recognit., 42:2372–2391, 2009.

[45] S. Prince, K. Xu, and A. Cheok. Augmented reality camera trackingwith homogrpahies. IEEE Comput. Graphics Appl., 22(6):39–45, 2002.

[46] D. Ross, J. Lim, R.-S. Lin, and M.-H. Yang. Incremental learning forrobust visual tracking. Int. J. Comput. Vision, 77(1-3):125–141, 2008.

[47] Y. Rui and Y. Chen. Better proposal distributions: object tracking usingunscented particle filter. In Proc. CVPR, 2001.

[48] S. Saha, P. Mandal, Y. Boers, and H. Driessen. Gaussian proposal denstiyusing moment matching in smc. Stat. Comput., 19:203–208, 2009.

[49] C. Shen, A. van den Hengel, A. Dick, and M. Brooks. Enhancedimportance sampling: unscented auxiliary particle filtering for visualtracking. Lect. Notes Artif. Intell., 3339:180–191, 2005.

[50] G. Silveira and E. Malis. Unified direct visual tracking of rigid anddeformable surfaces under generic illumination changes in grayscale andcolor images. Int. J. Comput. Vision, 89(1):84–105, 2010.

[51] G. Silveira, E. Malis, and P. Rives. An efficient direct approach to visualslam. IEEE Trans. Robot., 24(5):969–979, 2008.

[52] A. Srivastava. Bayesian filtering for tracking pose and location of rigidtargets. In Proc. SPIE AeroSense, 2000.

[53] A. Srivastava, U. Grenander, G. Jensen, and M. Miller. Jump-diffusionmarkov processes on orthogonal groups for object pose estimation. J.Stat. Plan. Infer., 103:15–37, 2002.

[54] R. Subbarao and P. Meer. Nonlinear mean shift over riemannianmanifolds. Int. J. Comput.Vision, 84(1):1–20, 2009.

[55] Z. Sun and A. Hoogs. Image comparison by compound disjointinformation with applications to perceptual visual quality assessment,image registration and tracking. Int. J. Comput. Vision, 88(3):461–488,2010.

[56] C. Taylor and D. Kriegman. Minimization on the lie group so(3) andrelated manifolds. Technical Report, 1994.

[57] O. Tuzel, F. Porikli, and P. Meer. Learning on lie groups for invariantdetection and tracking. In Proc. CVPR, 2008.

[58] T. Vercauteren, X. Pennec, E. Malis, A. Perchant, and N. Ayache. Insightinto efficient image registration techniques and the demons algorithm.Lect. Notes Comput. Sci., 4584:495–506, 2008.

[59] H. Wang, D. Suter, K. Schindler, and C. Shen. Adaptive object trackingbased on an effective appearance filter. IEEE Trans. Pattern Anal. Mach.Intell., 29(9):1661–1667, 2007.

[60] Y. Wang and R. Schultz. Tracking occluded targets in high-similaritybackgroud: an online training, non-local appearance model and periodichybrid particle filter with projective transformation. In Proc. ICIP, 2009.

[61] J. Xavier and J. Manton. On the generalization of ar process toriemannian manifolds. In Proc. ICASSP, 2006.

[62] A. Yilmaz, O. Javed, and M. Shah. Object tracking: a survey. ACMComput. Surv., 38(4), 2006.

[63] A. Yilmaz, L. Xin, and M. Shah. Contour-based object tracking withocclusion handling in video acquired using mobile cameras. IEEE Trans.Pattern Anal. Mach. Intell., 26(11):1531–1536, 2004.

[64] S. Zhou, R. Chellappa, and B. Moghaddam. Visual tracking andrecognition using appearance-adaptive models in particle filters. IEEETrans. Image Process., 13(11):1491–1506, 2004.

[65] K. Zimmermann, J. Matas, and T. Svoboda. Tracking by an optimalsequence of linear predictors. IEEE Trans. Pattern Anal. Mach. Intell.,31(4):677–692, 2009.

Junghyun Kwon received the B.S. and Ph.D.degrees in Mechanical and Aerospace Engineer-ing from Seoul National University, Seoul, Korea,in 2002 and 2008, respectively. During his Ph.Dstudy, he worked as a research assistant for theRobotics Laboratory in the School of Mechanicaland Aerospace Engineering at Seoul National Uni-versity. From September 2008 to January 2011,he was a Brain Korea 21 postdoctoral fellow andworked for the Computer Vision Laboratory in theDepartment of Electrical Engineering and Computer

Science at Seoul National University. He is currently with TeleSecuritySciences, Inc., Las Vegas, USA. His research interests are in visual trackingand visual SLAM, novel approaches to image reconstruction for X-rayimaging systems, and automatic threat detection and recognition for aviationsecurity.

Hee Seok Lee received the B.S. degree in ElectricalEngineering form Seoul National University, Seoul,Korea, in 2006. He is currently pursuing the Ph.D.degree in the area of computer vision and workingas a research assistant for the Computer Vision Lab-oratory in the Department of Electrical Engineeringand Computer Science at Seoul National University.His research interests are in object tracking andsimultaneous localization and mapping for mobilerobots.

Frak C. Park received his B.S. in Electrical En-gineering from MIT in 1985, and Ph.D. in AppliedMathematics from Harvard University in 1991. From1991 to 1995 he was assistant professor of mechan-ical and aerospace engineering at the University ofCalifornia, Irvine. Since 1995 he has been professorof mechanical and aerospace engineering at SeoulNational University. From 2009-2012 was an adjunctprofessor in the Department of Interactive Comput-ing at the Georgia Institute of Technology, and in2001-2002 was a visiting professor at the Courant

Institute of Mathematical Sciences at New York University. His researchinterests are in robot mechanics, planning and control, visual tracking, andrelated areas of applied mathematics. In 2007-2008 he was an IEEE Roboticsand Automation Society (RAS) Distinguished Lecturer, and has served assecretary of RAS from 2009-2010 and 2012-2013. He has served as senioreditor of the IEEE Transactions on Robotics, an area editor of the SpringerHandbook of Robotics and Advanced Tracts in Robotics (STAR), and as anassociate editor of the ASME Journal of Mechanisms and Robotics. He is afellow of the IEEE, and incoming editor-in-chief of the IEEE Transactions onRobotics.

18

Kyoung Mu Lee received the B.S. and M.S. degreesin Control and Instrumentation Engineering fromSeoul National University (SNU), Seoul, Korea in1984 and 1986, respectively, and Ph. D. degree inElectrical Engineering from the USC (Universityof Southern California), Los Angeles, California in1993. He has been awarded the Korean GovernmentOverseas Scholarship during his Ph. D. courses.From 1993 to 1994 he was a research associate in theSIPI (Signal and Image Processing Institute) at USC.He was with the Samsung Electronics Co. Ltd. in

Korea as a senior researcher from 1994 to 1995. On August 1995, he joined thedepartment of Electronics and Electrical Eng. of the Hong-Ik University, andworked as an assistant and associate professor. From September 2003, he iswith the Department of Electrical and Computer Engineering at Seoul NationalUniversity as a professor and the director of the Automation Systems andResearch Institute. His primary research interests include object recognition,MRF optimization, tracking, and visual navigation. He is currently servingas an Associate Editor of the IPSJ Trans. on CVA, MVA, the Journal ofHMSP, and IEEE Signal Processing Letters. He also has served as a ProgramChair and Area Chairs of ACCV, CVPR, and ICCV. He is the recipientof Okawa Foundation Research Grant Award in 2006, Honorable MentionAward at the ACCV2007, the Most Influential Paper over the Decade Awardat IAPR MVA2009, and the Outstanding Research Award by the College ofEngineering of SNU in 2010. He is a Distinguished Lecturer of the Asia-Pacific Signal and Information Processing Association (APSIPA) for 2012-2013. He has (co)authored more than 150 publications in refereed journalsand conferences including PAMI, IJCV, CVPR, ICCV and ECCV. Moreinformation can be found on his homepage http://cv.snu.ac.kr/kmlee.

Documents

A Geometric Particle Filter for Template-Based Visual Tracking · EMPLATE-based object tracking is a particular form of visual tracking, in which the objective is to continuously