Object Detection in Surveillance Video from Dense Trajectoriesmori/research/papers/zhai-mva15.pdf · Object Detection in Surveillance Video from Dense Trajectories ... Object detection

Object Detection in Surveillance Video from Dense Trajectories

Mengyao Zhai, Lei Chen, Jinling Li, Mehran Khodabandeh, Greg MoriSimon Fraser UniversityBurnaby, BC, Canada

[email protected], [email protected], [email protected], [email protected], [email protected]

Abstract

Detecting objects such as humans or vehicles is acentral problem in video surveillance. Myriad stan-dard approaches exist for this problem. At their core,approaches consider either the appearance of people,patterns of their motion, or differences from the back-ground. In this paper we build on dense trajectories, astate-of-the-art approach for describing spatio-temporalpatterns in video sequences. We demonstrate an ap-plication of dense trajectories to object detection insurveillance video, showing that they can be used toboth regress estimates of object locations and accuratelyclassify objects.

1 IntroductionObject detection is a crucial first step in numerous

visual surveillance applications. High-level human ac-tivity analysis typically builds upon this step. As such,robust solutions for detecting and classifying objects invideos is a well-studied problem. Standard approachesto the problem include background subtraction, mov-ing point trajectory analysis, and appearance-basedmethods.

All of these methods have long histories in the com-puter vision literature. Appearance-based methodsare exemplified by the histogram of oriented gradi-ents (HOG) detector [1] and its variants. Backgroundsubtraction-based methods (e.g. [2]) find contiguousforeground regions and further classify them into ob-ject types. Point trajectory [3] or moving region [4]methods are related, finding groups of points mov-ing together or containing a motion different from thebackground.

Each of these methods has its shortcomings.Appearance-based methods are sensitive to highly tex-tured background regions and need to handle sub-stantial intra-class variation. Background subtractionmethods and moving point/region methods are sensi-tive to moving clutter regions. Further, there is oftenambiguity when distinguishing foreground from back-ground due to pixel-wise similarities between objectsand the background.

In this paper we present an application of the re-cently developed dense trajectories approach [5] forcombining both moving trajectories and discriminative(moving) appearance-based classification for object de-tection in surveillance video. The dense trajectoriesapproach has been shown to obtain state-of-the-artperformance on standard benchmarks for human activ-ity recognition, particularly for unconstrained internetvideos. Here we demonstrate their effectiveness for thetasks of human and vehicle detection in surveillancevideos. We focus on detecting only moving objects –note that while stationary objects can be of interest, ina surveillance context objects generally move at some

point, and a tracker or scene entry/exit point knowl-edge can be used to fill temporal gaps.

The contribution of this paper is the application ofthese descriptors to the problem of object detection insurveillance video. Our method utilizes dense trajec-tories in two steps. First, we detect moving regionsand estimate object locations by regressing from densetrajectory descriptors to object locations. After form-ing candidate detections, we then score them by train-ing classifiers upon a dense trajectory bag-of-wordsrepresentation. We demonstrate empirically that thismethod can be effective for human and vehicle detec-tion in surveillance video.

2 Previous WorkAs noted above, object detection in surveillance

video is a well-studied problem. Turaga et al. [6] pro-vide a survey of this literature. Classic approachesmentioned above include appearance histogram-basedmethods [1]. More recent methods based on refinedHaar-like features [7] and deep learning [8] have shownimpressive results for single image person detection.

In this paper we focus on surveillance video. An-alyzing the temporal domain should lead to more ro-bust detection algorithms for this static camera set-ting. Classic methods such as grouping moving featurepoints exist [3]. Similar methods have been exploredin the context of crowded videos of people [9]. Theaforementioned appearance-based methods have beenextended to video, e.g. [10, 11]. However, relativelyfewer recent methods that focus on motion patternsexist. The main focus of this paper is revisiting theidea of trajectory-based object detection and classifi-cation based on state-of-the-art trajectory descriptors.

3 Motion-based Detection ModelA high-level overview of our method is shown in

Fig. 1. First, we follow the dense trajectory pipelineof finding and tracking moving points over short timescales. From these we use regression to predict thepositions of objects. We then agglomerate nearby pre-dictions by clustering and generate bounding boxes.Finally classifiers are trained to discriminate objectsof interest from other regions. In the following subsec-tions we provide details of each of these steps.

3.1 Trajctory-aligned Motion Features

We develop an algorithm for detecting moving ob-jects in surveillance video. The first step of our algo-rithm is to generate a set of candidate object locations.Analagous to Hough transform-type voting methods(e.g.. [12]) we will do this by generating an initial setof points that can vote for possible object centers viaa regression step.

A key consideration is that we would like a largenumber of such points. In constrast, approaches such

Figure 1. Overview of detection procedure. Our method first computes dense trajectories from an inputvideo. Each dense trajectory votes for an object center via a learned regression model. These votes areaggregated using mean-shift clustering. Finally, bounding boxes are generated from the clusters and thenscored using a classifier on dense trajectory features.

as Harris interest points are sparse and will have dif-ficulty covering objects, especially given noise in thesubsequent regression step. Our approach is to ex-tract dense moving points from a given video, and usethese points to vote for possible object centers. We usethe dense trajectory algorithm from Wang et al. [5].In that algorithm, dense trajectories are obtained bytracking densely sampled image points through mul-tiple frames using optical flow, HOG descriptors areextracted around each point along a trajectory andnormalized to produce a feature vector for correspond-ing trajectories. Features extracted from a trajectoryare shown to be more robust compared to features ex-tracted from single points.

We use these densely sampled trajectory points andtheir descriptors as inputs to regress object positions,as described next.

3.2 Center Prediction Based on RegressionIn this section, we describe how possible object cen-

ter locations are predicted using a regression model andhow to detect objects given these predicted centers.

Let the number of trajectories be N and the lengthof each trajectory be L. Our goal is to predict thelocations of object centers, given a feature vector siextracted along a trajectory where {i = 1, 2, ..., N}. Aregression model is learned and the output of the re-gression model is the offset vector oi = (xi, yi) startingfrom points pij = (xij , yij) on a trajectory pointing topossible centers cij = (uij , vij) where {j = 1, 2, ..., L}.We assume all points on a trajectory should share ex-actly the same offset vector. Thus, the inputs of theregression model are features si and the outputs areoi. The center prediction can be computed as:

cij = pij + oi (1)

Many objects have internal symmetries. For exam-ple, a car has two front lights and four wheels, thuslocal features sometimes can be very similar but offsetvectors are quite different. This requires the model tohave the capability that given one feature vector theoutput should not be only one offset vector but severalpossible offset vectors. Suppose we want to produceM outputs given one input. Intuitively we can achievethis goal by clustering the input feature space into Msubspaces, train a regression model for each subspace,and finally get M outputs given one input. The prob-lem of doing so is that we cannot capture the relationsbetween pairs of outputs. Sometimes more accurateresults can be generated if the relations between out-puts are also modelled. To achieve this goal, we usethe dependent gaussian process model [13] where therelations between the outputs are modelled by addinga noise source which influences all outputs.

Suppose we want to produce M offset outputs Yk(s)where s ∈ Rp is a dense trajectory feature and k ={1, 2, ...,M}, and for each output we have Nk training

observations. Given M datasets Dk = {ski, oki}Mki=1, we

want to learn a model from the combined data D ={D1, D2, ..., DM} to predict (Y1(s′), Y2(s′), ..., YM (s′)for input s′ ∈ Rp. Each output is modelled as a sum ofthree gaussian processes U , V , and W . V is unique toeach output, U shares the same noise source to ensurethe outputs are not independent, and W is additivenoise. Thus we have Yk(s) = Uk(s) + Vk(s) + Wk(s).Let the covariance matrix be CovY , then we haveCovY = CovU + CovV + σ2. Following [13]:

CovUkk(d) =π

p2 r2k√|Ak|

exp(−1

4dTAkd) (2)

CovUkk′(c) =π

p2 rkrk′√|Ak +Ak′ |

exp(−1

2dTAk(Ak +Ak′)−1Ak′d) (3)

CovVkk(d) =2π

p2w2

k√|Bk|

exp(−1

4dTBkd) (4)

Where r, w, A, and B are parameters of gaussiankernels, and d are separation between two inputs siand si′ . Given CovYkk′ , the covariance matrix C is con-structed as below: CovY11 . . . CovY1M

.... . .

...CovYM1 . . . CovYMM

(5)

Now we can compute the log-likelihood:

L = −1

2log|C| − 1

2~oTC−1~o− N

2log(2π) (6)

where C is a function of parameters {r, w,A,B, σ}.And the mean µ of gaussian kernel is set to ~0 in ouralgorithm. Learning a model corresponds to maximiz-ing log-likelihood L, in our algorithm, the parametersare learned with gradient descent. The predictive dis-tribution at the kth output is a gaussian with meanµ̂ = ~qTC−1~o and variance σ̂ = κ − ~qTC−1~q, whereκ = CovYkk(0) and ~q = {CY

k1(s′ − s11), ..., CYkN1

(s′ −s1N1

), ..., CYkNM

(s′ − sNM1), ..., CYkNM

(s′ − sNMNM)}.

3.3 Detection Based on Center Prediction

Given a set of predicted center locations, mean-shiftclustering is used to generate object hypotheses. Sup-pose we have K object hypotheses returned by mean-shift clustering, each moving point is then assigned to

(a) Traffic Dataset

(b) VIRAT DatasetFigure 2. Precision Recall Curves. “Moving peo-ple” is evaluation only on ground-truth peoplewho are moving, “all people” is the entire set(moving and stationary).

the nearest cluster center. Possible foreground regionsof theK objects are the convex hulls of theseK clustersand these convex hulls are the initial object proposals.

As is often the case, objects have shadows and pointsbelonging to shadows also move together with objects.Beyond this, the generation of moving points may benoisy – e.g. a few points far from real moving regionsmay be generated. Thus the convex hulls may alsocover certain areas of the background. To shrink theconvex hull, we compute tight bounding boxes whichenclose the convex hulls and grid the bounding boxes.Grid cells are pruned if there are too few moving pointsinside. The convex hull of the remaining grids cells isused as the final bounding box.

3.4 ClassificationGiven all candidate bounding boxes produced by our

method, the final step is to score them according towhether they contain an object of interest. For exam-ple, we might be interested in detecting the people in ascene or the vehicles in a scene. We do this by traininga classifier based on dense trajectory features.

We consider all dense trajectory features that passthrough a candidate bounding box in one frame. Fol-low the standard approach, we vector quantize densetrajectory features into a bag of words representationusing 100 words. A discriminative classifier (SVM) istrained from a labeled training data set to classify eachcandidate bounding box as containing an object of in-terest or not.

4 Experiments

We test our model on two object detection problemsin surveillance video: vehicle detection and human de-tection.

4.1 Traffic DatasetOur model is first tested on a traffic dataset, where

the focus is on detecting vehicles. Example frames canbe seen in Fig. 3(top). The training set contains 500frames of size 480×704 pixels. The test set contains2258 frames with 2501 vehicles; ground-truth is ob-tained via manual labeling.

For training the regression model, all features ex-tracted from the training set are clustered into 10 sub-sets to generate 10 outputs. 50 trajectories from eachcluster that pass through ground truth bounding boxesare randomly picked to form the inputs of the regres-sion model.

In testing, we mark a region of interest correspond-ing to regions of the video frame where vehicles areof sufficient size. For this dataset, the scale changesare very large: 250 pixels diagonal to 20 pixels diag-onal. It is difficult to find a perfect bandwidth forentire images: if the bandwidth is too small, vehi-cles in larger scale will be over-segmented and if thebandwidth is too large, vehicles in smaller scale willbe grouped into one cluster. To address this problem,the image coordinates are manually divided into tworegions. For near-field regions, the vehicles have rela-tively large scale and thus mean-shift clustering is per-formed in world coordinates with bandwidth 3.5. Forfar-field regions, mean-shift clustering is performed inimage coordinates with bandwidth 20. Note that ina fixed-camera surveillance setting, these parameterscould be obtained with camera calibration.

To further improve the quality of the boundingboxes, we refine bounding boxes in two steps. Wemerge spatially nearby clusters constructed from densetrajectory points moving in the same direction, andlearn a simple model to adjust bounding box sizesbased on their image coordinates. For the mergingstep, we compute the average length of vehicles in theworld coordinates, which is 20, and if two blobs areclose enough and both lengths are less than 10, thesetwo blobs will be merged into one blob. Given thedense trajectories points and features, training takes4 hours to converge and during test time, every 10Ktrajectories take 23 minutes to process.

We compare every step of our method with a baselinewhich is background subtraction plus the same classi-fier used in our approach. We summarize the compar-ison of our model with the baseline in precision-recallcurves shown in Fig. 2(a). The red curve correspondsto our final detection result and the blue curve corre-sponds to the result of background subtraction.

4.2 VIRAT DatasetThe second dataset we use is the VIRAT

dataset [14]. We focus on detecting moving people inthe video sequences. The VIRAT dataset contains alarge amount of static surveillance camera video. Weuse Scene 0000, a parking lot scene containing peoplealong moving background clutter such as vehicles.

We define a training set containing 500 frames withframe size 1080×1920 pixels. Again, for regression all

Figure 3. Visualization of detection results. Top row shows vehicle detection results on the Traffic dataset,bottom row shows human detection results on the VIRAT dataset. The green bounding boxes are truepositives and the red bounding boxes are false positives. For the second row, the first 4 examples are topscoring true positives, the last four examples are the top scoring false positives.

features extracted from the training set are clusteredinto 10 subsets to generate 10 outputs. 50 trajectoriesof 5 people from each cluster going through groundtruth bounding boxes are randomly picked to form theinputs of the regression model.

Our test set contains 1900 frames and has 1571 man-ually labelled ground truth person locations. The im-age coordinates are manually divided into two regions,mean-shift clustering is performed in these two regionsseparately. The bandwidth of the near-field and far-field regions are 50 and 25, respectively. Detectionswith size smaller than a threshold are removed. Thetwo verification steps used in traffic dataset are notused on this dataset. We compare our method to base-lines of DPM [15], trained on the same positive dataand including hard negative mining, and backgroundsubtraction. We consider performance both on all peo-ple in the test set (“all people”) and only those that aremoving (“moving people”). The comparisons of finaldetection results with baselines are shown in Fig. 2 (b),using the standard i/u > 0.5 criterion. Note that ourmethods (red and blue curves) achieve higher preci-sion than the baselines. As expected, the static-personDPM detector achieves higher recall for the “all peo-ple” setting.

5 ConclusionA detection algorithm based on multi-output gaus-

sian process regression using dense trajectories is pro-posed in this paper. The regression method can gener-ate candidate detection bounding boxes superior to abaseline based on background subtraction. This is pos-sible because regression from moving points to possiblecenters are advantageous in that moving points denselycover the moving objects. Another factor to the suc-cess of our method is that the features extracted fromtrajectory are quite stable. Limitations to our methodinclude that since it is based on motion, only movingobjects can be detected. Objects covering few pixelsor overlapping in the image plane also present chal-lenges. In summary, our method demonstrates thatdense trajectories are effective for object detection insurveillance video.

References

[1] N. Dalal and B. Triggs, “Histograms of oriented gra-dients for human detection,” in CVPR, 2005.

[2] C. Wren, A. Azarbayejani, T. Darrell, and A. Pent-land, “Pfinder: Real-time tracking of the human body,”PAMI, vol. 19, pp. 780–785, July 1997.

[3] B. Coifman, D. Beymer, P. Mclauchlan, and J. Malik,“A realtime computer vision system for vehicle track-ing and traffic surveillance,” Transportation ResearchC 6C, vol. 4, pp. 271–288, Aug 1998.

[4] R. Cutler and L. Davis, “Robust real-time periodicmotion detection, analysis, and applications,” PAMI,vol. 22, no. 8, 2000.

[5] H. Wang, A. Klaser, C. Schmid, and C.-L. Liu, “Ac-tion recognition by dense trajectories,” in CVPR, 2011.

[6] P. Turaga, R. Chellappa, V. S. Subrahmanian, andO. Udrea, “Machine recognition of human activities:A survey,” TCSVT, October 2008.

[7] S. Zhang, C. Bauckhage, and A. Cremers, “Informedhaar-like features improve pedestrian detection,” inCVPR, 2014.

[8] P. Luo, Y. Tian, X. Wang, and X. Tang, “Switch-able deep network for pedestrian detection,” in CVPR,2014.

[9] V. Rabaud and S. Belongie, “Counting crowded mov-ing objects,” in CVPR, 2006.

[10] N. Dalal, B. Triggs, and C. Schmid, “Human detectionusing oriented histograms of flow and appearance,” inECCV, 2006.

[11] P. Viola, M. Jones, and D. Snow, “Detecting pedes-trians using patterns of motion and appearance,” inICCV, pp. 734–741, 2003.

[12] B. Leibe, E. Seemann, and B. Schiele, “Pedestrian de-tection in crowded scenes,” in CVPR, 2005.

[13] P. Boyle and M. Frean, “Dependent gaussian processes,”in NIPS, 2005.

[14] S. Oh, A. Hoogs, A. Perera, et al., “A large-scalebenchmark dataset for event recognition in surveil-lance video,” in CVPR, 2011.

[15] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ra-manan, “Object detection with discriminatively trainedpart based models,” PAMI, vol. 32, no. 9, 2010.

Documents

Object Detection in Surveillance Video from Dense Trajectoriesmori/research/papers/zhai-mva15.pdf · Object Detection in Surveillance Video from Dense Trajectories ... Object detection