Incremental Learning for Video-based Gait Recognition with ...little/links/ieee-maodi.pdf · we propose an incremental learning method for HMM-GMM. Secondly, tracking and modeling

1

Incremental Learning for Video-based GaitRecognition with LBP-Flow

Maodi Hu, Yunhong Wang, Member, IEEE, Zhaoxiang Zhang, Member, IEEE, De Zhang, and James J. Little,Member, IEEE

Abstract—Gait analysis provides a feasible approach for iden-tification in intelligent video surveillance. However, the effective-ness of dominant silhouette-based approaches is overly dependentupon background subtraction. In this work, we propose a novelincremental framework based on optical flow, including dynamicslearning, pattern retrieval, and recognition. It can greatly im-prove the usability of gait traits in video surveillance applications.Local Binary Pattern (LBP) is employed to describe the textureinformation of optical flow. This representation is called LBP-Flow, which performs well as a static representation of gaitmovement. Dynamics within and among gait stances becomesthe key consideration for multi-frame detection and tracking,which is quite different from existing approaches. To simulatethe natural way of knowledge acquisition, an individual HiddenMarkov Model (iHMM) representing the gait dynamics of asingle subject incrementally evolves from a population model thatreflects the average motion process of human gait. It is beneficialfor both tracking and recognition, and makes the training processof the HMM more robust to noise. Extensive experiments onwidely adopted databases have been carried out to show thatour proposed approach achieves excellent performance.

Index Terms—Gait recognition, individual HMM, incrementallearning, LBP-Flow.

I. INTRODUCTION

GAIT, as a promising unobtrusive biometric, has attractedmany researchers in recent years. In intelligent surveil-

lance, the advantage of accessibility at a distance makesgait a promising behaviour characteristic for recognition. Thesilhouette has been plausibly regarded as the starting pointof gait analysis, since gait researchers [1], [2], [3] managedto identify the subjects by individual walking styles usingsilhouette-based methods. However, most related work relieson human silhouettes extracted from background subtraction(tracking-by-detection). Consequently, the accuracy of gaitfeatures and even alignment may be directly influenced bya cluttered background. Optical flow is a widely used motionfeature that can relieve the problems that arise from silhouetteextraction.

There have been many other efforts at human motionanalysis, especially with optical flow. For example, in order

Maodi Hu, Yunhong Wang, Zhaoxiang Zhang, and De Zhang are withthe State Key Laboratory of Virtual Reality Technology and Systems,Beihang University, Beijing 100191, China, and also with the Labora-tory of Intelligent Recognition and Image Processing, School of ComputerScience and Engineering, Beihang University, Beijing 100191, China (e-mail: [email protected]; [email protected]; [email protected];[email protected])

James J. Little is with the Laboratory for Computational Intelligence,Department of Computer Science, University of British Columbia, 2366 MainMall, Vancouver, BC, Canada V6T 1Z4 (email: [email protected])

to represent the flow patterns for pedestrian detection [4], [5],[6] and action recognition [7], [8], Histogram of OrientedGradients (HOG), LBP and their variations have been suc-cessfully applied. The robustness and efficiency can be furtherenhanced by recent improvements on estimation and matchingalgorithms of high accuracy optical flow [9], [10], [11].

Optical flow can provide much richer details than a sil-houette, while holding the information of shape variations.A few pioneering papers have investigated the application ofoptical flow to gait recognition. Little et al. [12] extractedFourier phase features of various moments to represent theflow distribution, and identified subjects using Analysis ofVariance (ANOVA). Huang et al. [13] took the magnitudesof flow vectors as templates, and transformed them to a lowdimensional canonical space for comparison. Recently, Bashiret al. [14] proposed a histogram-based motion descriptorcapturing both motion intensity and motion direction. Theyfused horizontal and vertical flows to produce a normalizeddissimilarity score, and achieved 83.6% in the lateral view (i.e.90◦ angle) data of CASIA Gait Database (Dataset B) [15]. Ourwork adopts LBP descriptors to represent the flow pattern,and proposes an incremental learning method of HMM forrecognition. It achieves much better recognition performancethan the existing work.

Incremental learning has been widely employed in manyvideo-based applications, especially face tracking. For exam-ple, Ross et al. [16] proposed an updating method basedon incremental algorithms for Principal Component Analysis(PCA). They adapted the Eigen-basis while tracking to reflectchanges of appearance. Most existing work mainly considersthe variances of statistical features (and the movement onlyfor prediction), since the motion dynamics seems less usefulfor matching and recognition in their applications, such asface tracking. However, dynamics modeling is the core of gaitanalysis. In this work, we attempt to incrementally learn theperiodic gait dynamics, and exploit spatio-temporal relation-ships for both tracking and recognition. Similar to [17], gaitdynamics is regarded as the outward manifestation of stancetransitions. Unlike existing tracking algorithms like particlefilters [18] that depend on the similarities of appearancesbetween frames, this work aims to recover and compare theperiodic dynamics based on stance transitions. A time-slicedynamical model such as HMM is a natural way to describethis periodicity. An incremental learning method for HMMwith Gaussian Mixture Model (GMM) representation (de-noted as HMM-GMM afterwards) is proposed, which showspromising accuracy in tracking and recognition experiments.

2

iHMM

GMR Model

pHMM

Detection/Tracking

Prediction

Incremental Learning

IterationGMR Model

Fig. 1. The overall framework for incremental learning process

In addition, our approach takes advantage of the informationat different resolution levels with respect to coarse-to-fine flowcomputation.

The main contributions of our work are three-fold. Firstly,we propose an incremental learning method for HMM-GMM.Secondly, tracking and modeling with LBP-Flow are pro-cessed in a unified framework. Thirdly, the proposed approachachieves excellent accuracy in recognition experiments. Theoverall framework of the incremental learning process isshown in Figure 1.

The remainder of this paper is organized as follows. Sec-tion II simply presents the generating process of LBP-Flowfeatures. In Section III, the incremental learning method forHMM-GMM is proposed. We use this method to train theindividual Hidden Markov Model (iHMM) to represent the gaitpattern for each individual. The utilization of iHMM in detec-tion and tracking is described in Section IV. In Section V, wedescribe the implementation for gait recognition. Experimentalresults on CASIA Gait Database (Dataset B) [15] and CASIAGait Database (Dataset A) (formerly NLPR database) [19] aredemonstrated in Section VI. Finally, we draw conclusions inSection VII.

II. MOTION FEATURE

A. Flow Estimation

Optical flow techniques have been successful applied insegmentation, tracking, disparity measurement and many otherareas. In this work, we employ the high accuracy estimationmethod proposed in [20], which is also presented in [21], [22],[11]. The following is a brief description. Optical flow, asa velocity field associated with image changes [10], can beestimated based on the assumptions of gray value constancy(1), gradient constancy (2), and flow smoothness (3).

I(x, y, t)− I(x+ u, y + v, t+ 1) = 0; (1)∇I(x, y, t)−∇I(x+ u, y + v, t+ 1) = 0; (2)

Ψ(|∇u|2 + |∇v|2) = 0. (3)

where, t is index of current frame, x and y are the coordinatesin tth frame of image sequence I, and u and v are the apparentmotion (optical flow) from tth image to the t + 1th one. uand v are also known as horizontal flow and vertical flowrespectively. An example is shown in Figure 2. The negativepart of u denotes the flow from right to left, and the negativepart of v denotes the flow from bottom to up, while thepositive parts denote the flow of opposite direction. Note that

(a) (b)

−2.5

−2

−1.5

−1

−0.5

0

(c)

0

0.5

1

(d)

−0.8

−0.6

−0.4

−0.2

0

(e)

0.2

0.4

0.6

0.8

1

1.2

(f)

2

4

6

8

(g)

Fig. 2. The (a) tth and (b) t+ 1th frames of gait sequence in CASIA GaitDatabase (Dataset B). The corresponding (c) negative and (d) positive partsof horizontal flow u, the (e) negative and (f) positive parts of vertical flowv, and the (f) flow energy u2 + v2. The gray scale indicates the magnitudevalue of the flow.

the negative and positive parts are divided only for displaypurposes.

A weighted integral sum of the three left-side terms witha modified L1 penaliser comprises the energy function forminimization. By means of Euler-Lagrange equations, u andv can be optimized by fixed point and Successive Over Relax-ation (SOR) iterations. The details of the numerical iterationscan be found in [20], [21]. In order to avoid being trappedin a local minimum, a coarse-to-fine strategy and Gaussiansmoothing hopefully make the optimized flow close to theglobal minimum. The practical use of different resolutionlayers is described later.

B. LBP-Flow Feature

We employ LBP descriptors to encode the flow information.The LBP operator [23] is a local feature descriptor repre-senting discriminative texture, and has been successfully usedin various applications, such as face recognition [24]. In thiswork, the LBP2

8,1 feature that takes 8 sample points with radius1 is used. The superscript 2 means that the number of 0-1transitions is no more than 2.

The LBP feature is calculated on u and v respectively. Eachhistogram is computed in a 30×30 cell. The LBP operatorassigns a label to every pixel of a flow field (u or v) bythresholding the neighborhood of each pixel with the centerpixel value and considering the result as a binary number.Then, the histogram of the labels can be used as a texturedescriptor. The details can be found in [24]. Instead of non-overlapping windows, we skip 15 pixels (half a cell) in everywindow sliding to avoid aliasing. As a result, overlapped5×3 cells are constructed at resolution of 90×60, and patternhistograms are built in the 15 cells. These histograms are thenconcatenated to describe the texture of the entire patch. TheLBP features of u and v are concatenated and stored usinga long vector. This pattern constraint is called the uniform

3

pattern in [23]. It not only reduces the dimension, but alsofilters out noise under the assumption of flow smoothness.

III. INDIVIDUAL HMMA. Related Work

In gait analysis, stances are used to indicate the periodicallatent states over gait cycles. After years of physiologicalresearches [25], [26], [27], human gait is widely accepted asan identifiable periodic pattern with several stance phases. AnHMM that models the representation within and between statesis very suitable for this application.

Below we simply review the development of incrementallearning for an HMM. In addition to the off-line Expecta-tion Maximization (EM) algorithm, i.e. batch learning Baum-Welch (BW), the parameters of HMM can also be estimated in-crementally with improved convergence and reduced memoryrequirements [28]. Krishnamurthy et al. [29] derived on-lineEM schemes by using stochastic approximations to maximizethe Kullback-Leibler information measure. Stenger et al. [30]proposed the Incremental Baum-Welch (IBW) algorithm, inwhich the states of HMM with a single Gaussian model issplit using a Minimum Description Length (MDL) criterionto represent the background changes. It is further adapted toa discrete model with a new backward procedure based on aone-step lookahead by Florez-Larrahondo et al. [28], which isknown as the improved Incremental Baum-Welch Algorithm(IBW+). However, little work considers its discriminativecapability. For the purpose of learning gait dynamics forrecognition, the model representation should be enhanced. Weapply the idea of IBW+ to the HMM-GMM.

B. Symbol NotationFirstly, we clarify the symbol notation of iHMM, which

bears some resemblance to classical off-line HMM [31]. Thefeature vector extracted from tth frame is indicated as xt.There are two kinds of parameters within incremental learning.

The first one Θ is responsible for the model representation,which comprises temporal relationships A and spatial repre-sentations B. Each single stance within a gait cycle is repre-sented by a state in HMM, and the probability density function(pdf) of each state is modeled by a GMM. For an HMM-GMMconsisting of Q states with M Gaussian mixture components,A = {αij}1≤i≤Q,1≤j≤Q,i6=j denotes the transition probabilityfrom i to j, and B = {φik, µik, σik}1≤i≤Q,1≤k≤M denotesthe mixing coefficient, mean vector, and covariance matrix ofcomponent k at state i.

The second kind of parameter is run-time statistics, whichis updated and stored every time when a new frame arrives.bT (i) = P (qT = i|xT ,Θ) is the pdf of xT at state i, which

indicates the fitness of a single frame for an averaged walkingstance. Due to the usage of IBW+, both bT (i) and bT+1(i)are updated in the T th iteration.

bT (i) =

M∑k=1

φikN (xT ;µik, σik), (4)

bT+1(i) =

M∑k=1

φikN (xT+1;µik, σik). (5)

Note that bT+1(i) in T th iteration is different from the onein T + 1th iteration because B is also updated.cT (i, k) = P (qT = i,mit = j|xT ,Θ) is the probability of

xT being in component k at state i,

cT (i, k) =φikN (xT ;µik, σik)

bT (i), (6)

αT (i) = P (x1, . . . , xT , qT = i|Θ) is the forward cumulativeprobability of being in state i,

αT (i) =

{(∑Qj=1 αT−1(j)aji)bT (i) t > 1,

bT (i) t = 1,(7)

and βT (i) = P (xT , xT+1, qT = i|Θ) is the backward one,

βT (i) =

M∑j=1

aijbT+1(j). (8)

This backward procedure of β is known as the equationof IBW+ [28], which reduces the training complexity of β inbackward procedure of BW algorithm in discrete model fromO(n2T ) to O(n2). Although it does not improve the globaltime complexity, the experimental results in [28] show thatIBW+ converges faster than BW and IBW. Note that it requiresa one-step look ahead in the sequence of observations, and canbe seen as an example of a fixed-lag smoothing algorithm [32].γT (i) = P (qT = i|x1, . . . , xT+1,Θ) is the probability of

being in state i,

γT (i) =αT (i)βT (i)∑Qi=1 αT (i)βT (i)

, (9)

ξT−1(i, j) = P (qT−1 = i, qT = j|x1, . . . , xT+1,Θ) is theprobability of T − 1th frame being in state i and Tth framebeing in state j,

ξT−1(i, j) =

{ αT−1(i)aijbT (j)βT (j)∑Qi=1

∑Mj=1 αT−1(i)aijbT (j)βT (j)

t > 1,

0 t = 1.

The estimation of ξT−1(i, j) is improved by the approxima-tion of βT [28]. We use the parameters of a population HMM(pHMM) [17] to serve as Θ of the iHMM in the 0th iteration.Given the T th and T + 1th frames, bT (i), bT+1(i), cT (i, k),αT (i), βT (i), γT (i), and ξT−1(i, j) can be calculated in order,based on Θ in the T − 1th iteration, and the incrementallearning process of Θ in the T th iteration is carried outsubsequently.

C. Incremental Learning for iHMM

If the initial parameter settings are far away from the truevalues, the errors made at this stage will slow down theconvergence process [30]. Therefore, our incremental learningalgorithm starts from pHMM [17], i.e. the averaged dynamicalmodel, whose parameters are estimated using the off-line EMalgorithm given the collection of training sequences. Withincremental adjustments of iHMM parameters, the fitness andvalidity of specific individuals increase as time goes by. Itconforms to the common sense of human learning style.

Calinon et al. [33], [34] presented a direct update methodfor a GMM by assuming that the established set of posterior

4

probabilities remain the same when the new data are used toupdate the model. In our approach, similar updating weightsfor Gaussian components are integrated into the learningprocess. However, we take ct(i, k) to update the set ofposterior probabilities in each iteration, which produces aforget function like [35] but without explicit intervention. Theexpected frequency

T−1∑t=−N+1

λtγT (i)ct(i, k) (10)

measures the cardinality of the data sets belonging to compo-nent k at state qi, where the constant N0 in λt controls thedeviation rate from pHMM. The temporally-coherent assump-tion µTik ≈ µT−1ik [36] is employed to deduce the updating ruleof variances. Supposing there are N frames in the traininggroup of pHMM, we number them as x−N+1, . . . , x0. Giventhe values of model parameters estimated in the previousframes, the formulations suitable for T th updating are shownin Equation (11) to (14).

aTij =aT−1ij (

∑T−2t=−N+1 λtγt(i)) + ξT−1(i, j)∑T−1

t=−N+1 λtγt(i), (11)

φT

ik =

∑Tt=−N+1 λtγt(i)ct(i, k)∑T

t=−N+1 λtγt(i), (12)

µTik =µT−1ik (

∑T−1t=−N+1 λtγt(i)ct(i, k))∑T−1

t=−N+1 λtγt(i)ct(i, k) + γT (i)cT (i, k)

+γT (i)cT (i, k)xt∑T−1

t=−N+1 λtγt(i)ct(i, k) + γT (i)cT (i, k),

(13)

σTik =(σT−1ik + (µT−1ik − µTik)(µT−1ik − µTik)H)

·∑T−1t=−N+1 λtγt(i)ct(i, k)∑T−1

t=−N+1 λtγt(i)ct(i, k) + γT (i)cT (i, k)

+γT (i)cT (i, k)(xT − µTik)(xT − µTik)H∑T−1t=−N+1 λtγt(i)ct(i, k) + γT (i)cT (i, k)

,

(14)

where, λt =

{N0

N x ≤ 0,

1 x > 0.

Compared to previous studies on incremental HMM [30],[28], the proposed updating rules make it possible to modelthe state representations of the HMM by GMMs, which isbeyond the models with a single Gaussian assumption [30].

The time complexity of the incremental learning processis O(mn2T ), in which calculation of bt(i) takes O(m) time,and forward and backward procedures find αt(i) and βt(i) inextra O(n2T ) time. Note that, besides the run-time statistics,only the cumulative probabilities of λtγt(i) and λtγt(i)ct(i, k)are calculated in current updating and stored for furtherlearning. It can be considered that the incremental learningprocess updates the necessary sufficient statistics continually,and discards the original data. Other practical implementationissues of HMM including scaling and multiple observationsequences can be found in [31].

Motion

Detection

Detection and

Tracking

Recognition

Second

Layer

L2

First

Layer

L1

Third

Layer

L3

Fig. 3. Pyramid of flow patterns. The three layers function respectively andcomplementarily.

IV. GAIT PATTERN RETRIEVAL

Gait pattern retrieval is the preprocessing part of recogni-tion, and the silhouette was the intermediate output in mostcases. Due to the fact that most related work uses the datacollected from controlled environment, frame-by-frame fore-ground detection (tracking-by-detection) has been accepted asa plausible approach to segment the human region. However,in a real scenario, occlusion and cluttered background mayseverely affect the quality of the gait silhouette subtractedby background modeling, even the temporal alignment. Alot of work has been done to reduce these impacts, suchas silhouette refinement after cropping [1], [37]. We aim toimprove the accuracy of pattern retrieval during detection andtracking, in which temporal dynamics over gait cycles arecontinually exploited to extract the motion feature in newframes. A novel detection and tracking approach on flow fieldis proposed, resembling traditional pedestrian detection [38],[4] and regression based tracking.

A. Single-frame Detection

Since most gait analysis work is conducted in controlledenvironment, background modeling and silhouette extractionhave been widely used to obtain the gait patterns. However,they are bound to be vulnerable in complex or changeableapplication scenarios. In optical flow field, the motion andmotionless regions can be pixel-wise distinguished directly.But optical flow has its own shortages, in which the mainone is computational complexity. Regarding the high accuracyestimation algorithms proposed recently, the computationalburdens are mostly caused by coarse-to-fine warping. Butwithout the step-by-step iterations, the point-wise optimizationis likely to fall into local optima which drive flow estimationtowards much poorer results. To relieve this burden, wesuggest assigning tasks to different pyramid levels, which isshown in Figure 3. In our experiment, this Gaussian pyramidis constructed by scaling factor 0.9 and consists of 3 levels intotal, considering the resolution of the coarsest level and theaccuracy required by recognition.

Let the original image be the L3 layer. We take the L1(coarsest) layer to accomplish motion detection. In this layer,the entire flow scene is segmented into several motion patches(L1 patches) and motionless ones. In order to facilitate theextraction of connected components, we form patch regions

5

with rectangle bounding boxes. The motionless patches arediscarded, and the regions of L1 patches are taken into furtherimage warping and flow adjustment.

In L2 layer, patches including human-like feature are se-lected. Inspired by [39], the properties of self-clustering andcontinuity in gait cycles are taken into account during themodeling procedure. Our matching strategy is based on thelikelihood of iHMM, which reflects not only temporal stancetransitions but also spatial similarity within each gait stance.The learning process can be also interpreted as an on-linespatio-temporal clustering process. For single-frame detectionat first sight, the matching score is the likelihood of a patchbeing a sample of iHMM in the 0th iteration (i.e. pHMM),which is

∑Ni=1 b1(i).

We assume a uniform distribution for prior probabilities,which means all the stance indexes 1 ≤ i ≤ Q are equallyconsidered in the single-frame detection. Resembling [38], [4],a detection window is scanned across inside each L2 patch atall positions and possible scales.

B. Prediction

Considering the similarity within one stance, we constructthe prediction model for each stance. The stance index of thetth frame within a patch sequence is

qt = argmaxjγt(j). (15)

We employ a regression approach called Gaussian MixtureRegression (GMR) [40]. Without loss of generality, the fol-lowing deduction is with an arbitrary stance i. We propose anincremental learning method for GMR, and integrate it intoa unified framework with iHMM. Considering the relativelystable speed and direction in one stance, we assume a linearmovement to represent the motion pattern in the specific time.For the bounding box of the tth frame, let Cvt be the verticalcenter, Cht be the horizonal center, ht be the height, and wtbe the width. We define yt to indicate the displacement ofbounding box from the tth frame to the t+ 1th one.

yt = {Cvt+1 − Cvt

ht,Cht+1 − Cht

wt,ht+1

ht,wt+1

wt}. (16)

Thus, the displacement is invariant of current scale, andcan be used for regression analysis. Let F be the dimensionof motion feature vector xt of the tth frame. Given the set ofknown variables

x = {xt : qt = i} ∈ RF , (17)

and unknown variables

y = {yt : qt = i} ∈ R4, (18)

a generic regression problem is to estimate the dependentvariables yT of a new data-case xT when the T th framearrives. Note that x is normalized by rescaling. Assume thedata follow the joint density

P (x, y) =

K∑k=1

φkN (x, y;µk, σk), (19)

where,K∑k=1

φk = 1, µk = {µx,k, µy,k},

σk =

(σx,k σxy,k

σyx,k σy,k

),

µx,k and σx,k are obtained directly from the state s of theiHMM. Given the probability of xT being in kth component,the expected distribution of yT being in component k is

P (yT , k|xT , k) = N (yk; µ̂y,k, σ̂y,k), (20)

where,µ̂y,k = µy,k + σyx,k(σx,k)−1(xT − µx,k),

σ̂y,k = σy,k − σyx,k(σx,k)−1σxy,k,

µ̂y,k and σ̂yx,k are the mean value and variance of componentk in conditional probability P (yT |xT , k) [41]. Therefore, theprobability of yT given the T th flow patch xT is

P (yT |xT ) =

K∑k=1

cT (k)N (y; µ̂y,k, σ̂y,k), (21)

where, cT (k) =φkN (xT ;µik, σik)

bT (i).

Based on the joint Gaussian mixture distribution in Equa-tion (21), the possible displacements yT are produced bysampling from P (yT |xT ). All the parameters including φk,µk and σk for k = 1 . . .K are initially estimated from thetraining group of pHMM, and updated incrementally to adaptto the individual cases. The model parameters µx,k and σx,kfor k = 1 . . .K are the intersection of the both models. Theyare updated only once when a new frame arrives. Note that,in our experiment, the parameters of the GMR model aresimply initialized by using a well-trained pHMM. Given thetth frame with qt = i and the GMM representation of stancei in the pHMM, we can estimate the probability p(k|xt) forany Gaussian component k. Due to the underlying relationshipbetween xt and yt, we assume that

p(k|xt) = p(k|xt, yt). (22)

The calculation of φk, µk, and σk of stance i is straightfor-ward. Let Ni be the number of training samples with qt = i.

φ0k =1

Ni

Ni∑t=1

p(k|xt), (23)

µ0k =

∑Ni

t=1[xt, yt]p(k|xt)∑Ni

t=1 p(k|xt), (24)

σ0k =

∑Ni

t=1 p(k|xt)([xt, yt]− µ0k)([xt, yt]− µ0

k)T∑Ni

t=1 p(k|xt),

(25)

where, [] denotes the operation of vector concatenation, andφ0k is identical to φ

0

ik. This trick dispenses with another EMiteration [42] for the initialization of GMR model, and bringsit into correspondence with the iHMM from the begining. Theincremental learning process of the GMR model functionsas the same way of Equation (12) to (14), except that only

6

the regression parameters of the current stance index areupdated. Note that the iHMM and the GMR model sharesome runtime statistics. For example, the collection of λtγt(i)in Equation (12) is directly used in the incremental learningprocess of GMR model.

C. Multi-frame Detection and Tracking

Frame-by-frame detection is time-consuming and may causediscontinuities in sequence retrieval. In contrast, depending ontemporal correlations among gait stances, multi-frame detec-tion can be efficient and robust even with low resolution andpartial occlusion. Information such as walking speed, direc-tion, and scale is conducive to further tracking. In comparisonwith single-frame likelihood, the multi-frame likelihood ofpatch sequence {xi}i=1...T is more natural and expressivewhen estimating the probability for detection. After we obtainthe L2 candidates detected when the subject first appears,the forward cumulative probability α can be estimated incre-mentally. Thus, the probability of the observation sequenceP (x1, . . . , xT |Θ) is

P (x1, . . . , xT |Θ) =

Q∑i=1

αT (i). (26)

Based on P (yT |xT ) predicted by Equation (21), the loca-tions and scales of scanning windows for xT+1 are producedby sampling. For each possible patch x̃T+1, the probability ofthe sequence with x̃T+1 is

P (x1, . . . , xT , x̃T+1|Θ) =

Q∑i=1

M∑j=1

αT (i)aijbT+1(j). (27)

The right-hand side can also be written asQ∑i=1

αT+1(i),

orQ∑i=1

γT (i) =

Q∑i=1

αT (i)βT (i).

When more frames arrive, the human-like L2 patch se-quence is selected for further L3 refinement if and only ifthe likelihood is larger than a threshold, which means multi-frame detection has been accomplished. Moreover, the multi-frame detection is not necessarily a forward process. Onepractical suggestion is to iteratively select the patch withlargest

∑Qi=1 γt(i) for 1 ≤ t ≤ T back and forth. We

suggest increasing the number of patch candidates in the multi-frame detection process, since the dynamics and directioninformation has not been completely learned.

Note that the recognition performance is bound to decreaseif the view angle varies greatly. We employ the intuitive ideasproposed in [43]. They estimate the view angle (walking direc-tion) before recognition, which is accomplished by maximallikelihood decision with iHMMs of all the possible views. Atthe end of multi-frame detection, the view is determined by

argmaxview

P (x|Θview) (28)

and becomes fixed. In our experiments, we use a simplestrategy afterwards, that is to track and retrieve the followingL2 patches until the likelihood falls below a low threshold,i.e. the subject disappears in current scene. The predictionand matching strategy for tracking are similar to multi-framedetection except with fixed view. In practice, if the walkingdirection changes, α is reserved as an augend for furtherprocessing. Color and other information could be employedin tracking. In addition, the turning action may also assist thefinal identification, but it is not the content of this article.

V. RECOGNITION

Recognition approaches can be divided into two categories.The first one (model-based) uses the gallery set to trainstatistical models for all the individuals, and compare thelikelihood with probe data. The second one (exemplar-based)directly uses the distance between gallery samples and probeones.

A. Model-based Recognition

Recognition approach based on iHMM is straight-forward.Let Θid denote the iHMM trained by the gallery set of subjectid. Given the data-case probe, the recognition process can besimply solved by Maximal A Posterior (MAP) rule,

argmaxidP (probe|Θid). (29)

where P (probe|Θid) is the probability of the observationsequence probe given Θid. Its calculation shares the same formas Equation (26), in which the major computational burden iscaused by forward procedure.

Moreover, we try to compare the incrementally learnediHMM with classical off-line HMM (oHMM) [31]. In general,with inadequate training samples, statistical modeling appearsto result in overfitting if we need a relatively large parametricspace. For example, in the CASIA Gait Database (DatasetB), 6 video sequences including about 12 gait cycles arecaptured for each subject. Given any subject, oHMM con-verges with a high confidence level after off-line EM trainingwith the first 5 sequences (including about 10 gait cycles).Nevertheless, oHMM produces zero likelihood with the lastsequence. The small sample size leads to overfitting to therelatively complicated model, and thereby weak generalizationability. However, higher model complexity is required formulti-class classification (recognition), especially in a largedataset. Therefore, we imitate the function of the deviationrate from incremental learning. Firstly, a population set iscreated by randomly selecting N1 sequences from the traininggroup. Then, for each subject, we take the union of 5 gallerysequences and population set into off-line EM training. Notethat, due to the randomness of population set, the experimentalresults marked as oHMM are the averages of five tests withdifferent random selections.

B. Exemplar-based Recognition

In addition, two exemplar-based recognition methods arecompared. The first one compares the similarity between aver-ages of different sequences, which we abbreviate as AVG. The

7

second one is well known Dynamic Time Warping (DTW).The recognition results are based on the distance between thegallery set and the probe one.

argminid

5∑si=1

d(probe, galleryidsi ), (30)

where, d ∈ {dAVG, dDTW }.

dAVG measures the Euclidean distance between temporalaverages of two sequences denoted by S1 and S2. Let T1 andT2 denote the lengths of S1 and S2 respectively.

dAVG(S1, S2) = || 1

T1

T1∑t1=1

S1(t1)− 1

T2

T2∑t2=1

S2(t2)||. (31)

DTW, as a classical technique to measure the similaritybetween two sequences, has been successfully used in speech[44], gait [45], signature [46], etc. It is claimed [47] to bethe best solution known for time series problems in a varietyof domains. The key idea is to find an optimal warpingfunction by step-by-step matching. Despite of its generallyacknowledged accuracy, the O(T 2) time complexity is one ofthe chief drawbacks. Due to the non-commutative property,we define the distance metric dDTW [48] as the average ofbidirectional distances normalized by warped length.

dDTW (S1, S2) = DTW (S1, S2) +DTW (S2, S1), (32)

Note that, for exemplar-based recognition, all the gallerysamples need to be retained. Consequently the number ofcomparisons increases with the size of gallery set. Therefore,in terms of time and space expenditure, model-based meth-ods have clear advantages over exemplar-based ones in testprocedure.

VI. EXPERIMENTAL RESULTS AND ANALYSIS

Following most work in gait recognition, the CASIA GaitDatabase (Dataset B) [15] and CASIA Gait Database (DatasetA) (formerly NLPR database) [19] are chosen for our experi-ments. All of these experiments are based on the gait patternsretrieved by the aforementioned detection and tracking strate-gies. To compare the performance of LBP-Flow based methodswith silhouette-based ones, the results of the GEI+PCA+LDAmethod [49] are given. However, the baseline method we useis slightly different from the one presented in [49]. Gait perioddetection has not been adopted, since it is hard to perform inlimited frames, frontal views, and noisy images, which arecovered in our experiments. Note that the baseline methodrequires background modeling (or a background image) tocrop the silhouette, and employs supervised learning withknown labels of the subjects in the training group. In ourmethod, neither background nor subject label has been used.As the starting point of the iHMM, a pHMM is learned fromthe training group to model the average gait motion, whichdoes not require any label representing the identity or numberof the subjects. Moreover, optical flow relieves the requirementof background subtraction in the baseline method.

LBP-Flow is denoted as LF hereafter. The feature vector ofeach experiment is projected into a 50 dimensional subspace

by PCA. The values of the parameters used in our experimentsare as follows. M = 6, Q = 4, N1 = 1000, and N0 =80. The parameter tuning process is discussed later. In all theexperiments, we only use the sequence numbered as the lastone to be the probe set for each subject.

A. Indoor ExperimentsIn the CASIA Gait Database (Dataset B) [15], indoor

gait data from 11 views are captured. Three most importantinfluence factors, view angle, clothing and carrying conditionchanges, are separately considered. There are 124 subjects(93 males and 31 females) in this Dataset B. The datasetis divided into two groups. The training group contains 40subjects (the first 20 males and the first 20 females) forthe training of pHMM and the initialization of GMR model.The test group contains the rest 84 subjects for performanceevaluation. Several video sequences for each influence factorare captured for each subject. The test group is further dividedinto a gallery set and a probe set. All the flow patches arealigned and resized into 90×60 bounding boxes according totheir geometric centers, which provides the training and testprocesses with a stable scale, especially for the views close to0◦ or 180◦.

1) Recognition With Limited Frames: Figure 4 shows therecognition performance with limited gallery frames of normalgait patterns. For each subject, increasing number of framesin the first sequence serve as the gallery set, and the lastsequence is used to be the probe set. In this dataset, almost allthe sequences have length of more than 20. However, if thenumber of frames we required exceeds the sequence length,we simply use the entire sequence.

It can be seen from Figure 4 that LF+iHMM outperformsother methods in most cases. Although the accuracy of base-line method is slightly better than LF+iHMM in the beginningof the tests with the 18◦ view sequences, the available infor-mation is exploited by iHMM in much more degree as moreframes come. Moreover, iHMM is trained progressively andincrementally, which means the former frames can be removedfrom memory when a new one arrives. In summary, the iHMMnot only facilitates the training and test processes, but alsogreatly improves the recognition performance.

2) Recognition with Influence Factors: We conduct crossrecognition experiments on three kinds of gait patterns, i.e. nm(gait patterns without any influence factor), bg (carrying bags)and cl (wearing big clothes). Table I shows the recognitionperformance in the side view (90◦) with and without influencefactors, where the best one in each test is highlighted in boldand the worse one is marked with a w. We use nm-bg toindicate the recognition experiments when nm serves as thegallery set and bg servers as the probe set, and so on and soforth. In the recognition experiments with the same gait patternfor gallery and probe sets (i.e. nm-nm, bg-bg, cl-cl), we takethe first 5/1/1 nm/bg/cl sequence(s) as the gallery set, and takethe last sequence as the probe set. But in other experiments, allthe sequences of gallery gait patterns are used as the galleryset. For example, in nm-bg experiment, all the 6 nm sequencesare taken into the gallery set and only the last bg sequenceserves as the probe set.

8

5 10 15 200

0.1

0.2

0.3

0.4

0.5

Number of Frames

Re

co

gn

itio

n P

erf

orm

an

ce

LF+AVG

LF+DTW

LF+oHMM

LF+iHMM

Baseline

(a)

5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Number of Frames

Re

co

gn

itio

n P

erf

orm

an

ce

LF+AVG

LF+DTW

LF+oHMM

LF+iHMM

Baseline

(b)

5 10 15 200

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Number of Frames

Re

co

gn

itio

n P

erf

orm

an

ce

LF+AVG

LF+DTW

LF+oHMM

LF+iHMM

Baseline

(c)

5 10 15 200

0.1

0.2

0.3

0.4

0.5

Number of Frames

Re

co

gn

itio

n P

erf

orm

an

ce

LF+AVG

LF+DTW

LF+oHMM

LF+iHMM

Baseline

(d)

5 10 15 200

0.1

0.2

0.3

0.4

0.5

Number of Frames

Re

co

gn

itio

n P

erf

orm

an

ce

LF+AVG

LF+DTW

LF+oHMM

LF+iHMM

Baseline

(e)

5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Number of Frames

Re

co

gn

itio

n P

erf

orm

an

ce

LF+AVG

LF+DTW

LF+oHMM

LF+iHMM

Baseline

(f)

Fig. 4. Rank 1 recognition performance with increasing numbers of galleryframes. (a) 18◦ view, (b) 54◦ view, (c) 90◦ view, (d) 126◦ view, (e) 162◦view, (f) 180◦ view.

TABLE IRANK 1 RECOGNITION PERFORMANCE (%) WITH AND WITHOUT

INFLUENCE FACTORS IN THE 90◦ VIEW (LF IS USED WITH THE FIRSTFOUR METHODS)

Method AVG DTW oHMM iHMM Baseline

nm-nm 71.4 61.9w 63.8 94.0 90.5nm-bg 13.1w 21.4 19.7 45.2 44.1nm-cl 20.2w 25.0 22.6 42.9 22.6bg-nm 16.7 31.0 12.3w 45.2 30.9bg-bg 63.1 17.9 31.8 64.2 3.6wbg-cl 9.5w 17.9 9.9 25.0 16.7cl-nm 15.5 27.4 9.9w 36.9 21.4cl-bg 11.9 22.6 9.1w 22.6 17.9cl-cl 60.7 0.0w 21.4 57.1 3.6

In can be seen that the performance fluctuates in differentsituations for most methods. For example, the baseline methodonly receives 3.6% Rank 1 scores in the bg-bg and cl-clexperiments. But in general, iHMM with LF overwhelminglyoutperforms the others. Note that the training group of pHMMremains unchanged as the nm sequences of the first 40subjects.

To give an idea of running speed, the computing time oftraining and test processes for bg-nm is shown in the Table II.The codes run in a computer equipped with a Quad CPU 2.83Gand 4GB Memory, and are implemented in Matlab R2010a64bit.

TABLE IICOMPUTING TIME (SECOND) OF BG-NM BY DIFFERENT METHODS, WHEN

40 TRAINING SUBJECTS AND 84 TEST SUBJECTS ARE CONSIDERED

Method AVG DTW oHMM iHMM

Training - - 983.9 371.1Test 2.2 410.6 121.6 120.3

0 500 1000 1500 2000 2500 30000

0.2

0.4

0.6

0.8

1

N0

Re

co

gn

itio

n P

erf

orm

an

ce

LF+iHMM M=5 Q=5

LF+iHMM M=4 Q=6

LF+iHMM M=6 Q=4

(a)

0 50 100 150 200 2500

0.2

0.4

0.6

0.8

1

N1

Re

co

gn

itio

n P

erf

orm

an

ce

LF+oHMM M=5 Q=5

LF+oHMM M=4 Q=6

LF+oHMM M=6 Q=4

(b)

Fig. 5. Influence of (a) N0 (for iHMM) and (b) N1 (for oHMM) on Rank1 recognition performance in the 90◦ view

3) nm-nm Recognition in Different Views: The first 5 nmsequences in each view are used as the gallery set, and the lastsequence serves as the probe set. The recognition performanceof the 11 views are shown in Table III. We attribute the poorperformance of DTW to the diversity of sequence lengths andineffective distance normalization. In other words, since cycledetection is not employed, DTW performs even worse thanAVG in most cases.

4) Parameter Tuning: In an iHMM, there are three parame-ters required for tuning, i.e. N0, M and Q. Finding an optimalvalue in a three dimensional space accounts for a heavyburden of computation. However, it can be facilitated by theinternal correlations. Some results of exploratory experimentsare shown in Figure 5(a). It can be clearly seen that the optimalvalue of N0 is not sensitive to the changes of M and Q,which is consistent with reason and common sense. Throughthe explorations with several random values of M and Q, wefind that the assignment N0 = 1000 dominates in almost allcases, which makes the estimation of M and Q much easier.Some intermediate results are shown in Table IV. Note thatan iHMM with optimal values of M and Q is unlikely to givethe best performance if an inappropriate value is assigned toN0. The parameter tuning process of N1 for oHMM is similar.And we find that oHMM achieves its best performance withN1 = 80, which is shown in Figure 5(b).

5) Prediction and Tracking: Let Rectint(A,B) denote thethe area of intersection of two rectangles A and B. We defineSim(Rb, Rp) to represent the accuracy of predicted location.

Sim(Rb, Rp) =2∑Ns

i=1 Rectint(Rb, Rpi )Ns(Area(Rb) + Area(Rpi ))

, (33)

where Rb is ground truth bounding box obtained by back-ground subtraction (background image is provided by thedatabase), Rp = {Rpi }i=1···Ns are the predicted ones sampledfrom P (yT |xT ) in the Equation (21), and Ns is the number ofsampling points. We set Ns = 10 in the following experiments.

9

TABLE IIINM-NM RANK 1 RECOGNITION PERFORMANCE (%) IN DIFFERENT VIEWS

View(◦) 0 18 36 54 72 90 108 126 144 162 180 Avg.

LF-H+AVG 66.7 72.6 61.9 69.0 66.7 57.1 58.3 56.0 60.7 82.1 65.5 65.2LF-H+DTW 83.3 72.6 59.5 50.0 52.4 50.0 45.2 58.3 61.9 71.4 85.7 62.8

LF-H+oHMM 83.6 75.5 67.4 65.5 64.8 62.9 66.9 66.9 69.8 72.1 81.7 70.6LF-H+iHMM 98.8 94.0 88.1 88.1 90.5 91.7 89.3 88.1 89.3 94.0 98.8 91.9

LF-V+AVG 86.9 78.6 67.9 69.0 67.9 66.7 59.5 71.4 76.2 82.1 82.1 73.5LF-V+DTW 83.3 69.0 67.9 65.5 61.9 67.9 69.0 75.0 77.4 88.1 88.1 73.9

LF-V+oHMM 60.7 63.1 59.3 54.5 53.1 53.1 53.8 55.2 56.4 66.2 75.2 59.2LF-V+iHMM 97.6 97.6 90.5 89.3 89.3 90.5 88.1 92.9 97.6 97.6 98.8 93.6

LF+AVG 82.1 84.5 72.6 85.7 78.6 71.4 67.9 75.0 81.0 88.1 86.9 79.4LF+DTW 81.0 75.0 63.1 60.7 53.6 61.9 63.1 66.7 72.6 83.3 89.3 70.0

LF+oHMM 83.1 79.3 72.9 70.2 68.1 63.8 71.0 71.0 74.3 79.5 90.5 74.9LF+iHMM* 98.8 98.8 94.0 94.0 92.9 94.0 94.0 95.2 97.6 98.8 100.0 96.2

Baseline 96.4 92.9 96.4 91.7 90.5 90.5 92.9 90.5 90.5 95.2 95.2 93.0

TABLE IVINFLUENCE OF COMPLEXITY OF HMM ON RANK 1 RECOGNITION

PERFORMANCE (%) BY LF+IHMM IN THE 90◦ VIEW WITH N0 = 1000.(ROW HEADER DENOTES THE NUMBER OF MIXTURE COMPONENTS, AND

COLUMN HEADER DENOTES THE NUMBER OF STATES.)

State N 1 2 3 4 5 6 7 8

1 84.5 84.5 84.5 86.9 85.7 86.9 86.9 88.12 84.5 85.7 86.9 88.1 88.1 88.1 88.1 89.33 85.7 88.1 88.1 89.3 89.3 89.3 90.5 90.54 86.9 89.3 88.1 90.5 88.1 91.7 85.7 88.15 89.3 88.1 91.7 91.7 92.9 88.1 89.3 89.36 89.3 88.1 92.9 94.0 92.9 85.7 84.5 88.17 89.3 90.5 91.7 92.9 89.3 86.9 86.9 86.98 85.7 86.9 90.5 86.9 86.9 89.3 82.1 85.7

Taking one subject in the CASIA Gait Database (DatasetB) as an example, the prediction results of 0◦, 72◦, and 126◦

views are shown in Figure 6. It can be clearly seen thatour proposed approach achieves acceptable accuracy in mostframes. Note that this experiment is conducted without multi-frame detection, but only based on single-frame detection inthe 1st frame and tracking afterwards. Therefore, the predictedlocation in the 2nd frame relies on the location and scaledetected in the 1st frame and the initial GMR model ofcurrent view angle. Doubling the number of sampling pointswith doubled variance or exhaustive neighboring searching issuggested to improve the accuracy in the first few frames,which is a part of the multi-frame detection process. Thisstrategy of view angle estimation is described in Section IV-C,and achieves 100% accuracy.

B. Outdoor Experiments

In the CASIA Gait Database (Dataset A) (formerly NLPRdatabase) [19], gait sequences are captured in an outdoorenvironment. All subjects walk along a straight-line path atfree cadences. The resulting database includes 20 subjects andtwo sequences from right to left are used in our experiments.The length of each image sequence varies with the pace ofthe walker, but the average is about 90 frames, which is

(a)

5 10 15 20 25 30 35 400.8

0.9

1

Frame No.

Accu

racy

(b)

(c)

5 10 15 20 25 30 35 400.8

0.9

1

Frame No.

Accu

racy

(d)

Fig. 6. Instances of the proposed prediction approach. (a)(c) The 2nd, 12th,22nd, and 32nd frames of the 0◦ and 72◦ views respectively. Blue rectanglesdenote the ground truth bounding boxes, and red rectangles denote the onespredicted by our proposed approach. (b)(d) The corresponding accuracy curveof the two sequences.

much longer than the average sequence length in CASIA GaitDatabase (Dataset B).

1) Recognition: The CASIA Gait Database (Dataset A) hasbeen widely used in gait recognition work recently [43], [50],[51], [52], [53]. Their work focuses on the postprocessing ofsilhouette sequences. As we mentioned above, it may improvethe accuracy rate in recognition, but the application scopeis still narrow. As shown in Table V, compared with thesilhouette-based method [53], we achieve the same 100%accuracy. For each time, we select one gallery subject andone probe subject, and take others to constitute the training

10

TABLE VCOMPARISONS ON RECOGNITION PERFORMANCE WITH EXISTING WORK

Method \ Score Rank 1 (%) Rank 5 (%)

Cheng et al. 2007 [43] 86 100Lee et al. 2009 [50] 88.75 -

Hong et al. 2009 [51] 91.25 96.25Lee et al. 2010 [52] 92.5 96.25Lee et al. 2011 [53] 100 100

LF+AVG 70.0 80.0LF+DTW 50.0 75.0

LF+oHMM 95.0 95.0LF+iHMM 100.0 100.0

Baseline [49] 75.0 90.0

(a)

0

20

40

60

80

100

120

(b)

1

2

3

4

5

6

7

(c)

(d)

0

20

40

60

80

100

120

(e)

5

10

15

20

25

(f)

Fig. 7. Examples of (a)(d) noisy images, (b)(e) foreground images subtractedfrom background, and (c)(f) flow energies with different noise densities(a)(b)(c) ds = 0.1; (d)(e)(f) ds = 0.5. Blue rectangles denote the groundtruth bounding boxes.

group (for iHMM) or population set (for oHMM). The sameparameters illustrated above are used. The performance sug-gests that the optimal values of the parameters are not sensitiveto the data.

To evaluate our approach in a noisy scenario, we furtherconduct experiments with artificial noise. Firstly, we add saltand pepper noise with density ds on the original images, anduse symmetric Gaussian lowpass filter of size 5-by-5 withstandard deviation 5 to produce the noisy images. Several ex-amples are shown in Figure 7. We can see from Figure 7(b)(e)that it is difficult to crop gait silhouettes from the imagesequences with dense noise, even with the background imageprovided by the dataset. The Rank 1 and Rank 5 recognitionperformance with different noise densities ds are shown inFigure 8. It shows that the proposed LF+iHMM methodachieves better recognition performance than silhouette-basedbaseline method, and gives encouraging Rank 5 accuracy evenusing the heavily noisy images with little information. Notethat the silhouettes used for baseline method are binarizedfrom the foreground images with in the bounding box, shownas Figure 7(b)(e). The bounding boxes to select the regionfor silhouette are obtained by clean images and backgroundsubtraction.

2) Prediction and Tracking: Similar to the prediction andtracking experiments in indoor dataset, the results with outdoordataset are shown in Figure 9. Outdoor backgrounds did notaffect our prediction results, because the optical flow [20] isrobust under a considerable amount of noise. We also have

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.2

0.4

0.6

0.8

1

Noise Density

Ra

nk 1

Re

co

gn

itio

n P

erf

orm

an

ce

LF+AVG

LF+DTW

LF+oHMM

LF+iHMM

Baseline

(a)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Noise Density

Ra

nk 5

Re

co

gn

itio

n P

erf

orm

an

ce

LF+AVG

LF+DTW

LF+oHMM

LF+iHMM

Baseline

(b)

Fig. 8. (a) Rank 1 and (b) Rank 5 recognition performance with differentnoise densities ds.

(a)

5 10 15 20 25 30 35 400.8

0.9

1

Frame No.

Accu

racy

(b)

Fig. 9. Instance of the proposed prediction approach. (a) The 2nd, 12th,22nd, and 32nd frames respectively. Blue rectangles denote the ground truthbounding boxes, and red rectangles denote the ones predicted by our proposedapproach. (b) The corresponding accuracy curve.

done the prediction and tracking experiments on noisy imagesequences, which are shown in Figure 10 and 11. It canbe seen that our algorithm can give acceptable predictioneven with noisy data. However, different from motion-basedtracking methods, such as Kalman filter and particle filter [18],our prediction is based on the incrementally learned modeland motion features extracted from current frame instead oftrajectories. If the image noise is so heavy that the LF featuresare totally different from the model in GMR, no predictionwould be given, as shown in the Figure 11(c).

VII. CONCLUSIONS AND FUTURE WORK

In this work, a novel incremental framework for gait track-ing and recognition is proposed. The widely used gait silhou-ette is replaced by LBP-Flow, which is also invariant to colorchanges. Besides shape information, the flow pattern providesmuch more detail, and proves effective in gait recognition.

(a) (b) (c)

Fig. 10. Successful instance of the proposed prediction approach with noisyimages, where ds = 0.5. (a)-(c) The 8th, 16th, 24th frames respectively.Blue rectangles denote the ground truth bounding boxes, and red rectanglesdenote the ones predicted by our proposed approach.

11

(a) (b) (c)

Fig. 11. Failed instance of the proposed prediction approach with noisyimages, where ds = 0.7. (a)(b) The predictions in the 8th and 12nd frames.(c) No prediction in the 16th frame. Blue rectangles denote the ground truthbounding boxes, and red rectangles denote the ones predicted by our proposedapproach.

The flow pyramid is designed to reduce the computationaltime during the processes of retrieval and recognition. In orderto accelerate flow computation, we eliminate uninterestingregions in the coarsest layer of image pyramid. The temporalrelationships between gait stances are incrementally learnedfor detection, tracking, and recognition. The proposed retrievalapproach performs well when the camera and walking direc-tion are fixed. It is a pioneering work which can extend theapplication scenes of biometric gait analysis. The proposedframework initiates a novel approach for gait dynamics mod-eling, and achieves promising results in widely used datasets.We consider conducting further research in a broad varietyof data with severe occlusions, temporary actions and movingcameras, such as sports videos.

ACKNOWLEDGEMENT

This work is funded by the National Basic Research Pro-gram of China (No. 2010CB327902), the National Natural Sci-ence Foundation of China (No. 61005016, No. 61061130560),the Fundamental Research Funds for the Central Universities,the Innovation Foundation of BUAA for PhD Graduates, andthe China Scholarship Council.

REFERENCES

[1] Z. Liu, L. Malave, and S. Sarkar, “Studies on silhouette quality and gaitrecognition,” in Proc. IEEE Comput. Vis. Pattern Recog., 2004.

[2] Z. Liu and S. Sarkar, “Simplest representation yet for gait recognition:Averaged silhouette,” in Proc. IEEE/IAPR Int. Conf. Pattern Recog.,2004.

[3] S. Yu, T. Tan, K. Huang, K. Jia, and X. Wu, “A study on gait-basedgender classification,” IEEE Trans. Image Process., vol. 18, no. 8, pp.1905–1910, August 2009.

[4] N. Dalal, B. Triggs, and C. Schmid, “Human detection using orientedhistograms of flow and appearance,” in Proc. Eur. Conf. Comput. Vis.,vol. 3952, May 2006, pp. 428–441.

[5] X. Wang, T. X. Han, and S. Yan, “An hog-lbp human detector withpartial occlusion handling,” in Proc. IEEE Int. Conf. Comput. Vis., Sept.2009, pp. 32–39.

[6] M. Enzweiler, A. Eigenstetter, B. Schiele, and D. M. Gavrila, “Multi-cue pedestrian classification with partial occlusion handling,” in Proc.IEEE Comput. Vis. Pattern Recog., vol. 1, June 2010, pp. 990–997.

[7] A. A. Efros, A. C. Berg, G. Mori, and J. Malik, “Recognizing actionat a distance,” in Proc. IEEE Int. Conf. Comput. Vis., vol. 2, 2003, pp.726–733.

[8] R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal, “Histograms oforiented optical flow and binet-cauchy kernels on nonlinear dynamicalsystems for the recognition of human actions,” in Proc. IEEE Comput.Vis. Pattern Recog., vol. 1, June 2009, pp. 1932–1939.

[9] N. Papenberg, A. Bruhn, T. Brox, S. Didas, and J. Weickert, “Highlyaccurate optic flow computation with theoretically justified warping,” inInt. J. Comput. Vis., vol. 67, no. 2, Aug. 2006, pp. 141–158.

[10] K. R. T. Aires, A. M. Santana, and A. A. D. Medeiros, “Optical flowusing color information: preliminary results,” in Proc. ACM Symposiumon Applied Comput., 2008, pp. 1607–1611.

[11] T. Brox and J. Malik, “Large displacement optical flow: Descriptormatching in variational motion estimation,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 33, no. 3, pp. 500–513, March 2010.

[12] J. Little and J. Boyd, “Recognizing people by their gait: the shape ofmotion,” Videre: Journal of Computer Vision Research, vol. 1, no. 2,pp. 1–45, 1998.

[13] P. Huang, C. Harris, and M. Nixon., “Human gait recognition incanonical space using temporal templates.” in Proc. Vis., Image, SignalProcess., vol. 146, no. 2, Aug. 1999, pp. 93–100.

[14] K. Bashir, T. Xiang, and S. Gong, “Gait representation using flow fields,”in Proc. British Mach. Vis. Conf., 2009.

[15] S. Yu, D. Tan, and T. Tan, “A framework for evaluating the effect ofview angle, clothing and carrying condition on gait recognition,” in Proc.IEEE/IAPR Int. Conf. Pattern Recog., vol. 4, 2006, pp. 441–444.

[16] D. A. Ross, J. Lim, and R.-S. Lin, “Incremental learning for robustvisual tracking,” Int. J. Comput. Vis., vol. 77, no. 1-3, pp. 125–141,November 2008.

[17] Z. Liu and S. Sarkar, “Improved gait recognition by gait dynamicsnormalization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 6,pp. 863–876, June 2006.

[18] B. Ristic, S. Arulampalam, and N. Gordon, “Beyond the kalman filter:Particle filters for tracking applications,” in Artech House, 2004.

[19] L. Wang, T. Tan, H. Ning, and W. Hu, “Silhouette analysis-based gaitrecognition for human identification,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 12, no. 25, pp. 1505–1518, August 2003.

[20] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, “High accuracyoptical flow estimation based on a theory for warping,” in Proc. Eur.Conf. Comput. Vis., vol. 4, May 2004, pp. 25–36.

[21] T. Brox, “From pixels to regions: Partial differential equations in imageanalysis,” PhD thesis in Saarland University, 2005.

[22] T. Brox, C. Bregler, and J. Malik, “Large displacement optical flow,” inProc. IEEE Comput. Vis. Pattern Recog., vol. 1, June 2009, pp. 41–48.

[23] T. Ojala, M. Pietikinen, and D. Harwood, “A comparative study oftexture measures with classification based on feature distributions,”Pattern Recog., vol. 29, no. 1, pp. 51–59, January 1996.

[24] T. Ahonen, A. Hadid, and M. Pietikainen, “Face description with localbinary patterns: Application to face recognition,” IEEE Trans. PatternAnal. Mach. Intell., vol. 28, pp. 2037–2041, 2006.

[25] M. P. Murray, A. B. Drought, and R. C. Kory, “Walking patterns ofnormal men,” Journal of Bone Join Surgery, vol. 46, no. 2, pp. 335–360, 1964.

[26] M. P. Murray, “Gait as a total pattern of movement,” American Journalof Physical Medicine, vol. 46, no. 1, pp. 290–333, Feb 1967.

[27] M. Hildebrand, “Vertebrate locomotion an introduction how does ananimal’s body move itself along?” Bioscience, vol. 39, no. 39, pp. 764–765, 1989.

[28] G. Florez-Larrahondo, S. Bridges, and E. A. Hansen, “Incrementalestimation of discrete hidden Markov models based on a new backwardprocedure,” in Proc. National Conf. on Artificial Intell., vol. 1, 2005,pp. 758–763.

[29] V. Krishnamurthy and J. B. Moore, “On-line estimation of hiddenMarkov model parameters based on the kullback-leibler informationmeasure,” IEEE Trans. Signal Process., vol. 41, pp. 2557–2573, 1993.

[30] B. Stenger, V. Ramesh, N. Paragios, F.Coetzee., and J.Buhmann, “Topol-ogy free hidden Markov models: Application to background modeling,”in Proc. IEEE Int. Conf. Comput. Vis., vol. 1, 2001, pp. 294–301.

[31] L. R. Rabiner, “A tutorial on hidden markov models and selectedapplications in speech recognition,” in Proc. IEEE, vol. 77, no. 7, July1988, pp. 1169–1179.

[32] J. Moore, “Discrete-time fixed-lag smoothing algorithms,” Automatica,no. 9, pp. 163–173, 1973.

[33] S. Calinon and A. Billard, “Incremental learning of gestures by imitationin a humanoid robot,” in Proc. ACM/IEEE Int. Conf. on Human-RobotInteraction, March 2007, pp. 255–262.

[34] S. Calinon, Robot Programming by Demonstration: A ProbabilisticApproach. EPFL/CRC Press, 2009, ePFL Press ISBN 978-2-940222-31-5, CRC Press ISBN 978-1-4398-0867-2.

[35] W. Kei and M. Takao, “Data stream prediction using incremental hiddenMarkov models,” in Data Warehousing and Knowledge Discovery, ser.Lecture Notes in Computer Science. Springer Berlin / Heidelberg,2009, vol. 5691, pp. 63–74.

[36] O. Arandjelovic and R. Cipolla, “Incremental learning of temporally-coherent Gaussian mixture models,” Technical Papers - Society ofManufacturing Engineers, 2006.

12

[37] Z. Liu and S. Sarkar, “Effect of silhouette quality on hard problems ingait recognition,” IEEE Trans. Syst., Man, Cybern. B, vol. 35, no. 2, pp.170–183, April 2005.

[38] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in Proc. IEEE Comput. Vis. Pattern Recog., vol. 1, 2005, pp.886–893.

[39] L. Lee, G. Dalley, and K. Tieu, “Learning pedestrian models forsilhouette refinement,” in Proc. IEEE Int. Conf. Comput. Vis., vol. 1,2003, pp. 663–670.

[40] S. H. Guang, “Gaussian mixture regression and classification,” PhDthesis in Rice University, 2004.

[41] K. V. Mardia, J. T. Kent, and J. M. Bibby, Multivariate Analysis.Academic Press, Duluth, London, 1979.

[42] J. A. Bilmes, “A gentle tutorial on the em algorithm and its application toparameter estimation for Gaussian mixture and hidden Markov models,”University of Berkeley, Tech. Rep. ICSI-TR-97-021, 1997.

[43] M. H. Cheng, M. F. Ho, and C.-L. Huang, “Gait analysis for humanidentification through manifold learning and HMM,” Pattern Recog.,vol. 41, no. 8, pp. 2541–2553, August 2008.

[44] O. Ghitza, “Auditory nerve representation as a front-end for speechrecognition in a noisy environment,” Computer Speech and Language,vol. 1, no. 2, pp. 109–130, December 1986.

[45] A. Kale, N. Cuntoor, B. Yegnanarayana, A. Rajagopalan, and R. Chel-lappa, “Gait analysis for human identification,” in Proc. Int. Conf. Audio-and Video-based Biometric Person Authentication, 2003, pp. 706–714.

[46] M. E. Munich and P. Perona, “Continuous dynamic time warpingfor translation-invariant curve alignment with applications to signatureverification.” in Proc. IEEE Int. Conf. Comput. Vis., vol. 1, 1999, pp.108–115.

[47] C. Ann and R. E. Keogh, “Three myths about dynamic time warpingdata mining,” in Proc. SIAM Int. Conf. Data Mining, vol. 21, 2005, pp.506–510.

[48] A. Veeraraghavan, A. K.Roy-Chowdhury, and R. Chellappa, “Match-ing shape sequences in video with applications in human movementanalysis.” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 12, pp.1896–1909, December 2005.

[49] S.Sarkar, P.J.Phillips, Z.Liu, I.R.Vega, P.Grother, and K.W.Bowyer, “Thehuman ID gait challenge problem: Data sets, performance, and analysis,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 2, pp. 162–177,February 2005.

[50] H. Lee, S. Hong, I. F. Nizami, and E. Kim, “A noise robust gaitrepresentation: Motion energy image,” Int. J. Control, Autom., Syst.,vol. 7, no. 4, pp. 638–643, 2009.

[51] S. Hong, H. Lee, K.-A. Toh, and E. Kim, “Gait recognition using multi-bipolarized contour vector,” Int. J. Control, Autom., Syst., vol. 7, no. 5,pp. 799–808, 2009.

[52] B. Lee, S. Hong, H. Lee, and E. Kim, “Multiple views gait recognitionusing view transformation model based on optimized gait energy image,”in Proc. Int. Conf. Ind. Electron. Appl., June 2010, pp. 313–316.

[53] ——, “Regularized eigenspace-based gait recognition system for humanidentification,” in Proc. Int. Conf. Ind. Electron. Appl., June 2011, pp.1966–1970.

Maodi Hu received the B.E. degree in computer sci-ence from Southwest University, Chongqing, China,in 2008. He is currently working toward the Ph.D.degree in the Laboratory of Intelligent Recognitionand Image Processing, Beijing Key Laboratory ofDigital Media, School of Computer Science andEngineering, Beihang University, Beijing, China.

Since September 2011, he has worked at Lab-oratory for Computational Intelligence, Universityof British Columbia, Vancouver, BC, Canada, asa Visiting Scholar. His research interests include

pattern recognition, computer vision, machine learning, and motion analysis.

Yunhong Wang (M’98) received the B.S. degreein electronic engineering from Northwestern Poly-technical University, Xian, China, in 1989, the M.S.degree and the Ph.D. degree in electronic engineer-ing from the Nanjing University of Science andTechnology, Nanjing, China, in 1995 and 1998,respectively.

She worked at the National Laboratory of Pat-tern Recognition, Institute of Automation, ChineseAcademy of Sciences, Beijing, China, from 1998 to2004. Since 2004, she has been a Professor with the

School of Computer Science and Engineering, Beihang University, Beijing,China, where she is also the Director of Laboratory of Intelligent Recognitionand Image Processing, Beijing Key Laboratory of Digital Media. Her researchinterests include biometrics, pattern recognition, computer vision, data fusion,and image processing. She is a member of the IEEE Computer Society.

Zhaoxiang Zhang (M’08) received the B.S. degreein electronic science and technology from the Uni-versity of Science and Technology of China, Hefei,China, in 2004, and the Ph.D. degree in patternrecognition and intelligent systems from the Insti-tute of Automation, Chinese Academy of Sciences,Beijing, China, in 2009.

In October 2009, he joined the Laboratory of In-telligent Recognition and Image Processing, BeijingKey Laboratory of Digital Media, School of Com-puter Science and Engineering, Beihang University,

Beijing, as a Faculty Member. His research interests include computer vision,pattern recognition, image processing, and machine learning.

De Zhang received the B.E. degree in electricengineering from the North China University ofTechnology, Beijing, China, in 2001, and the secondB.E. degree in computer science from TsinghuaUniversity, Beijing, in 2003. He is currently workingtoward the Ph.D. degree in the Laboratory of In-telligent Recognition and Image Processing, BeijingKey Laboratory of Digital Media, School of Com-puter Science and Engineering, Beihang University,Beijing.

He worked at Visualization and Intelligent Sys-tems Laboratory, University of California, Riverside, as a Visiting Scholarfrom 2009 to 2010. His research interests include computer vision, patternrecognition, and image/video processing.

James J. Little (M’80) received the A.B. degreefrom Harvard College, Cambridge, MA, in 1972 andthe M.Sc. and Ph.D. degrees in computer sciencefrom the University of British Columbia (UBC),Vancouver, BC, Canada, in 1980 and 1985, respec-tively.

From 1985 to 1988, he was a Research Scientist atthe MIT Artificial Intelligence Laboratory. Currently,he is a Professor of Computer Science at the UBCand Director of the Laboratory for ComputationalIntelligence. His research interests include computa-

tional vision, robotics, and spatiotemporal information systems with a partic-ular interest in stereo, motion, tracking, mapping, and motion interpretation.

Documents

Incremental Learning for Video-based Gait Recognition with ...little/links/ieee-maodi.pdf · we propose an incremental learning method for HMM-GMM. Secondly, tracking and modeling