9
BOBBY2: Buffer Based Robust High-Speed Object Tracking Keifer Lee Jun Jet Tai Swee King Phang Taylor’s University, 1 Jalan Taylor’s, 47500, Subang Jaya, Selangor, Malaysia [email protected], [email protected], [email protected] Abstract In this work, a novel high-speed single object tracker that is robust against non-semantic distractor exemplars is introduced; dubbed BOBBY2. It incorporates a novel ex- emplar buffer module that sparsely caches the target’s ap- pearance across time, enabling it to adapt to potential tar- get deformation. In addition, we demonstrate that exemplar buffer is capable of providing redundancies in case of unin- tended target drifts, a desirable trait in any middle to long term tracking. Even when the buffer is predominantly filled with distractors instead of valid exemplars, BOBBY2 is ca- pable of maintaining a near-optimal level of accuracy. In terms of speed, BOBBY2 utilises a stripped down AlexNet as feature extractor with 63% less parameters than a vanilla AlexNet, thus being able to run at 85 FPS. An augmented ImageNet-VID dataset was used for training with the one cycle policy, enabling it to reach convergence with less than 2 epoch worth of data. For validation, the model was bench- marked on the GOT-10k dataset and on an additional small, albeit challenging custom UAV dataset collected with the TU-3 UAV. 1. Introduction Object tracking is one of the perennial tasks in the do- main of computer vision and robotic perception amongst many other vision tasks due to the generality of its nature [14, 4, 5, 23, 31]. A particular interests as of late is to perform single object tracking (SOT) tasks on unmanned aerial vehicles (UAVs), especially light-weight multi-rotor units, due to their wide- ranging applicability across industrial domains [2, 18]. However, there are significant challenges facing such ef- forts in terms of computational efficiency, and robustness against confusing non-target instances (herein referred to as distractors). The limited payload and battery capacity of a typical multi-rotor UAV restricts the type of allowable com- puting device on board to light-weight embedded systems, thereby severely constraining the computing power avail- able. On the other hand, tracking from a mid-flight UAV usually entails having low target-to-image ratio amidst com- plex environment - e.g. low resolution scale and aspect ratio varying targets, drastic camera movement and occlusion - possibly littered with distractors. Thus, for feasible object tracking to beperformed locally on a typical UAV, it would necessarily need to be both computationally efficient and robust in the aforementioned scenarios. This work is a step towards the direction realizing such applications by intro- ducing a novel high-speed and robust SOT model. Generally, the current methods on approaching object tracking can be segregated into the two categories of corre- lation filter based trackers and fully convolutional trackers. Siamese fully convolutional trackers that learn the task in an end-to-end manner have been shown by Betinetto et al. [1] to be a simple yet highly competitive approach to hand- crafted object trackers, setting new state-of-the-art results in tracking capability. However, as with most deep learn- ing models it is too computationally expensive for on-board embedded systems to run optimally as is. The same effi- ciency problem persists for subsequent follow-up Siamese convolutional trackers [33, 17, 1, 26]. Thus, the work in lo- cally performed SOT for UAVs have so far been relying on hand-crafted trackers and correlation filter models in order to meet the real-time demand on a budget [29, 28, 30, 25]. However, even with the more computationally efficient al- gorithms, both manually designed and learned, the compu- tational overhead is still too high for optimal performance on a low-end embedded device. Correlation filters and other handcrafted trackers such as LSST [30] and CACFT [29] achieves impressive accuracy and average run-time perfor- mance of 41 FPS and 19 FPS respectively. However, the latter correlation filter models are much more prone to the problem of ghosting and class confusion. The challenges above form the basis of this study where a simple yet highly efficient Siamese convolutional SOT is introduced, dubbed BOBBY2. In order to decrease the computational overhead, BOBBY2 utilizes a combination of sparse feature extraction, lightweight feature extractors, and run-time heuristics to improve run-time FPS signifi- cantly compared to its closest counterpart, SiamFC [1]. As for robustness against distractors, a novel buffer base ex- arXiv:1910.08263v1 [cs.CV] 18 Oct 2019

BOBBY2: Buffer Based Robust High-Speed Object Tracking · Filters to estimate the target’s state properties such as ve-locity, width and height with a fast discriminative saliency

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BOBBY2: Buffer Based Robust High-Speed Object Tracking · Filters to estimate the target’s state properties such as ve-locity, width and height with a fast discriminative saliency

BOBBY2: Buffer Based Robust High-Speed Object Tracking

Keifer Lee Jun Jet Tai Swee King PhangTaylor’s University, 1 Jalan Taylor’s, 47500, Subang Jaya, Selangor, Malaysia

[email protected], [email protected], [email protected]

Abstract

In this work, a novel high-speed single object trackerthat is robust against non-semantic distractor exemplars isintroduced; dubbed BOBBY2. It incorporates a novel ex-emplar buffer module that sparsely caches the target’s ap-pearance across time, enabling it to adapt to potential tar-get deformation. In addition, we demonstrate that exemplarbuffer is capable of providing redundancies in case of unin-tended target drifts, a desirable trait in any middle to longterm tracking. Even when the buffer is predominantly filledwith distractors instead of valid exemplars, BOBBY2 is ca-pable of maintaining a near-optimal level of accuracy. Interms of speed, BOBBY2 utilises a stripped down AlexNetas feature extractor with 63% less parameters than a vanillaAlexNet, thus being able to run at 85 FPS. An augmentedImageNet-VID dataset was used for training with the onecycle policy, enabling it to reach convergence with less than2 epoch worth of data. For validation, the model was bench-marked on the GOT-10k dataset and on an additional small,albeit challenging custom UAV dataset collected with theTU-3 UAV.

1. IntroductionObject tracking is one of the perennial tasks in the do-

main of computer vision and robotic perception amongstmany other vision tasks due to the generality of its nature[14, 4, 5, 23, 31].

A particular interests as of late is to perform single objecttracking (SOT) tasks on unmanned aerial vehicles (UAVs),especially light-weight multi-rotor units, due to their wide-ranging applicability across industrial domains [2, 18].However, there are significant challenges facing such ef-forts in terms of computational efficiency, and robustnessagainst confusing non-target instances (herein referred to asdistractors). The limited payload and battery capacity of atypical multi-rotor UAV restricts the type of allowable com-puting device on board to light-weight embedded systems,thereby severely constraining the computing power avail-able. On the other hand, tracking from a mid-flight UAV

usually entails having low target-to-image ratio amidst com-plex environment - e.g. low resolution scale and aspect ratiovarying targets, drastic camera movement and occlusion -possibly littered with distractors. Thus, for feasible objecttracking to be performed locally on a typical UAV, it wouldnecessarily need to be both computationally efficient androbust in the aforementioned scenarios. This work is a steptowards the direction realizing such applications by intro-ducing a novel high-speed and robust SOT model.

Generally, the current methods on approaching objecttracking can be segregated into the two categories of corre-lation filter based trackers and fully convolutional trackers.Siamese fully convolutional trackers that learn the task inan end-to-end manner have been shown by Betinetto et al.[1] to be a simple yet highly competitive approach to hand-crafted object trackers, setting new state-of-the-art resultsin tracking capability. However, as with most deep learn-ing models it is too computationally expensive for on-boardembedded systems to run optimally as is. The same effi-ciency problem persists for subsequent follow-up Siameseconvolutional trackers [33, 17, 1, 26]. Thus, the work in lo-cally performed SOT for UAVs have so far been relying onhand-crafted trackers and correlation filter models in orderto meet the real-time demand on a budget [29, 28, 30, 25].However, even with the more computationally efficient al-gorithms, both manually designed and learned, the compu-tational overhead is still too high for optimal performanceon a low-end embedded device. Correlation filters and otherhandcrafted trackers such as LSST [30] and CACFT [29]achieves impressive accuracy and average run-time perfor-mance of 41 FPS and 19 FPS respectively. However, thelatter correlation filter models are much more prone to theproblem of ghosting and class confusion.

The challenges above form the basis of this study wherea simple yet highly efficient Siamese convolutional SOTis introduced, dubbed BOBBY2. In order to decrease thecomputational overhead, BOBBY2 utilizes a combinationof sparse feature extraction, lightweight feature extractors,and run-time heuristics to improve run-time FPS signifi-cantly compared to its closest counterpart, SiamFC [1]. Asfor robustness against distractors, a novel buffer base ex-

arX

iv:1

910.

0826

3v1

[cs

.CV

] 1

8 O

ct 2

019

Page 2: BOBBY2: Buffer Based Robust High-Speed Object Tracking · Filters to estimate the target’s state properties such as ve-locity, width and height with a fast discriminative saliency

emplar module is incorporated into the tracker to performfeature aggregation across time, and artificially generatednegative samples were introduced in training to help themodel learn better discriminative representations. Thus,BOBBY2 is capable of achieving 85 FPS with the full track-ing pipeline, and 700 FPS as a standalone model. In ad-dition, we have also demonstrated robustness of BOBBY2against distractors by explicitly inserting confusing dis-tractors as exemplars during tracking on the ImageNet-VID [19] dataset; it is capable of maintaining near-optimalperformance even with a collection of distractor-majorityexemplars in the buffer. For validation, the tracker is eval-uated on the GOT-10k dataset [11] and a small but chal-lenging custom UAV dataset collected with TU-3, a quad-rotor UAV on the campus parking lot. Another thing of noteis the training regime used [21, 22] - one cycle learningwith super-convergence, enabling the model to be trainedto convergence with just less than 2 epoch worth of data onImageNet-VID.

The following sections are laid out in the following man-ner - exposition of related works in Section 2, discussion onthe methodology employed in Section 3, review of the ex-perimental results on accuracy and computational efficiencyin Section 4, and a brief concluding discussion in Section 5.

2. Related Works

2.1. Deep Object Tracker

The predominant architecture used by current state-of-the-art single object trackers is of the Siamese CNN vari-ant [33, 17, 1, 26, 27]. In this design, two convolutionalfeature extractors, ϕ1 , ϕ2 are used to ingest an image of ascene x and the target z (or exemplar) respectively, wherebythe resulting feature maps are combined before being fed toa discriminative classifier g(X,Z). In practice, it is com-mon to use a single feature extractor to featurize both sceneand exemplar in order to obtain a set of comparable fea-ture maps. The intuition behind Siamese trackers could beconstrued as similarity learning, whereby the classifier triesto find the target patch from an approximately equivalentscene through a set of commonly computed feature mapsf (x , z ) = g(ϕ(x ), ϕ(z )).

Though Siamese networks have already been introducedearlier in other visual tasks [24, 32, 13], it was first ap-plied to the domain of object tracking independently byBetinetto et al. [1] with SiamFC and Held et al. [8] withGOTURN, setting new state-of-the-art results, and havesubsequently spawned other improved variants of the samearchitecture [33, 17, 26, 27]. Subsequent Siamese trackershave since then sought to improve in three general dimen-sions, accuracy (e.g. EAO, AUC), robustness against dis-tractors and run-time FPS. Models such as CFNet [26] andSiamRPN [17] introduced correlation filters which are com-

putationally cheaper to reduce the overall computation over-head. The former performs online update to the correlationfilters during inference while the latter pose the problemas a one-shot learning task, computing the filter parame-ters from the target’s initialization frame only, and performsregion proposal during tracking. DaSiamRPN [33] is animprovement of SiamRPN in terms of robustness againstdistractor instances by leveraging on a hard-negative min-ing scheme during training and custom loss terms to sur-gically penalize both inter-class and intra-class distractors.This enables DaSiamRPN to achieve significant advantagesin long-term tracking scenarios at real-time [15]. Quantita-tively, CFNet is capable of obtaining an average of 75 FPSwith competitive accuracy on a NVIDIA Titan X GPU, 160FPS for SiamRPN on a NVIDIA GTX1060 GPU, and anaverage of 135 FPS for DaSiamRPN on a powerful devicewith a NVIDIA Titan X GPU and 48GB RAM. Evidently,though Siamese trackers are simple and robust against dis-tractors, they are still computationally unviable for resourceconstrained real-time tracking on UAVs.

2.2. Real-Time Single Object Tracking

As one may surmise, applications such as real-timetracking on UAV thus far has largely avoided deep learn-ing models and opted for the significantly more computa-tionally efficient hand-crafted trackers, most of them withclosed-form solutions [29, 28, 30, 25]. Correlation fil-ters in particular are a very popular in such time-sensitiveuse cases, though they lack the robustness of deep learn-ing models [3, 10, 6]. Yang et al.’s LSST tracker refor-mulates large least-square functions into smaller and moretractable forms in the Fourier domain to be solved using theefficient Recursive Least-Square algorithm (RLS) [30].Xueet al. [29]’s proposed CACFT, a SOT tracking frameworkutilizing correlation filters with fused feature maps and aconditional sparse exemplar template updating module todynamically adapt to changing target properties. In [28],Wu et al. proposed FOLT, a framework that utilizes KalmanFilters to estimate the target’s state properties such as ve-locity, width and height with a fast discriminative saliencymap generator for tracking. Uzkent and Seo [25] in turnextended the use of kernelized correlation filters (KCF) byensembling, dubbed EnKCF for tracking. In their work, aseries of KCFs are applied in succession across a series offrames mediated by a particle filter.

In terms of run-time performance, EnKCF obtains thehighest FPS of 378 FPS on average, albeit at the cost of ac-curacy and robustness. FOLT followed with 141 FPS, LSSTwith 41 FPS and CACFT with 19 FPS. However, even thesehighly efficient algorithms may prove to be sub-optimal dueto the devices used for the aforementioned benchmarks be-ing still more powerful than a typical embedded systemlevel device. FOLT was implemented on a device with an

Page 3: BOBBY2: Buffer Based Robust High-Speed Object Tracking · Filters to estimate the target’s state properties such as ve-locity, width and height with a fast discriminative saliency

Figure 1. Overview of BOBBY2’s architecture. The dashed box denotes the sparsely activate exemplar feature extractor; activated everyξ frames. The feature extractor for both branches share the same underlying parameters for equivalent representation. The aggregatedexemplar features are concatenated along the channels in the buffer (shown βmax = 4) before being pass to the classifier.

Intel Xeon W3250 CPU, 16GB RAM and in C++, LSSTon an Intel i7-6700hq CPU with 4GB RAM, CACFT ona desktop class Intel i5-7500 CPU with 16GB RAM, andEnKCF on an unspecified desktop class platform. Thoughthese figures are not directly comparable due to hardwareand software differences, it could still offer useful insightsinto their relative performances. Another disadvantage ofthese hand-crafted models are their higher susceptibility tothe problem of ghosting and class confusion compared todeep models.

3. Methodology and Procedures

3.1. Overview of BOBBY2

BOBBY2 is an end-to-end learned and convolutionalSiamese object tracker with an exemplar buffer for featureaggregation. Feature aggregation as defined in the currentwork is the accumulation of features derived from the targetacross time in order to form a series of spatio-temporallyrepresentative exemplars. Figure 1 is a simplified illustra-tion of the network architecture behind BOBBY2.

A pre-trained AlexNet [16] is used as the backbone fea-ture extractor to produce a rich representation for the inputscene and exemplars. AlexNet is used instead of more ad-vance architectures such as VGG [20] and ResNet [7] be-cause the former has better computational efficiency com-pared to the latter models. In addition, though AlexNetis a shallower model, the resulting features are representa-tive enough for the task of template matching based objecttracking. Common mobile-oriented model networks such asSqueezeNet has been empirically found to be less optimalin terms of actual run-time FPS. Despite having lower pa-rameter count and storage size, these deeper networks havehigher memory demand, which is a significant bottleneckfor run-time FPS performance.

One of the main novelty of BOBBY2 is the feature buffer

module that caches sparse key exemplar frames across time.Inputs of scene, xi are ingested by the scene feature extrac-tor at every frame and the exemplar, zi only at frames delin-eated by the hyperparameter, ξ. During in-between framesthat do not correspond to the ξ series, the tracker behaves asan asymmetrical Siamese network by deactivating the ex-emplar branch. Formally defined as

f(xi,Ei) = g(ϕ(xi), ϕ(Ei)) (1)

where g is the template matching and target verificationfunction, ϕ is the feature extractor, and E the exemplarbuffer which in turn is defined as

E = {Z1,Z2,Z3, · · · ,Zβmax} (2)

where βmax is the upper-bound size of the exemplar buffer.The exemplar feature instances of the aggregated buffer areinterrelated in time with respect to ξ:

frameZj+1= frameZj

+ ξ (3)

Thus, the exemplar feature extractor is sparsely activatedat specific intervals which endows higher computational ef-ficiency as the interval grows. Ideally, the interval ξ is var-ied depending on the complexity of a given sequence. Se-quences with higher degree and rate of frame-to-frame tar-get variation or confusing instances necessitates a shorteractivation interval and vice versa, though that may not al-ways be possible. As for the buffer size, βmax of 4 was em-pirically found to be a suitable value with acceptable trade-offs in terms of robustness and efficiency.

3.2. Scene-Objectness Classification

BOBBY2 poses tracking as a multitask problem byjointly learning to determine the bounding box coordinatesof the target for localization, and a scene-objectness scoreto verify the target’s presence in the scene itself:

Page 4: BOBBY2: Buffer Based Robust High-Speed Object Tracking · Filters to estimate the target’s state properties such as ve-locity, width and height with a fast discriminative saliency

Ljoint = yobjLbbox + αLobj (4)

where yobj is the label for scene-objectness and α the scene-objectness loss multiplier.

For the bounding box loss, the pytorch Smooth-L1 lossfunction was used. Thus, if |xs,m − ys,m | < 1, Equation 5was used as the bounding box loss function, otherwiseEquation 6.

Lbbox =1

8M

4∑s=1

M∑m=1

(xs,m − ys,m)2 (5)

Lbbox =1

4M

4∑s=1

M∑m=1

(|xs,m − ys,m| − 0.5) (6)

where s are the sides of the bounding box, and M thelength of the dataset. The scene-objectness loss is simplythe Cross-Entropy Loss with 2 classes corresponding to tar-get absence and presence in a scene. The bounding boxloss is only enforced if a target is indeed present in a scene(scene-objectness label of 1). Otherwise, the network wasonly penalize with the amplified classification loss.

In order to do that, an additional 378,606 negative sam-ples were generated from the ImageNet-VID datasets [19],constituting approximately 30% of the total samples1. Thisis done to encourage the network to learn of more dis-criminative features beyond plain template matching, andto hedge against false positive instances where the target isabsent. The combination of a scene-objectness score and anexemplar buffer enables BOBBY2 to offer advantages overtrackers on the two different ends of the exemplar templateupdating spectrum; tracking by greedy exemplar update atevery frame, and by initialization-only exemplar. Unlikethe former costly class of greedy trackers that are suscepti-ble to problems such as potential target drifts from bad pre-dictions, BOBBY2 is much more robust in that regard byhaving a series of curated exemplar for redundancies. Yet,unlike the latter class of trackers that utilizes only the initial-ization frame as a template, BOBBY2 is capable of adaptingto significant target deformation and variance across timewith its sparse exemplar updates.

3.3. Approaching Real-Time Performance

BOBBY2 employs a lightweight feature extractor and aheuristical trick of frame-dropping in order to achieve real-time performance. As compared to the vanilla AlexNet,our implementation is much more lightweight, having lessthan half the parameters of a vanilla AlexNet despite hav-ing 4 times more exemplar information cached in the buffer.Shown in Table 1.

1The exact collection of negative samples and the general negativesgenerating scripts can be found in the project Github repository.

Figure 2. Average Overlap (AO) and Success Rate (SR) with re-spect to varying η on the GOT-10k dataset. Demonstrated minimalperformance drop when η = 2 .

Model Parameter Count FLOPSAlexNet 61.10M 4.29G

BOBBY2 22.23M 4.13G

Table 1. Computation overhead comparison.

Figure 3. The TU-3 UAV, constructed from a DJI F450 platformwith an ELP webcamera and a Raspberry Pi module.

In addition, the every frames coinciding with the drop-ping interval, η will be dropped in time constrained scenar-ios in order to fulfill the required response time. We showin Figure 2 that even for the most extreme case of η = 2where every interleaving frames are dropped,our model isrobust enough to maintain its optimal tracking accuracy andprecision, as shown in Figure 2. The approximate speed-upfactor (SF) with respect to the dropping interval is given bythe following simple equation.

SF =η

η − 1(7)

Page 5: BOBBY2: Buffer Based Robust High-Speed Object Tracking · Filters to estimate the target’s state properties such as ve-locity, width and height with a fast discriminative saliency

Figure 4. Sample of frames from the custom dataset from TU-3UAV. Target is the man in white framed by a yellow bounding box.

3.4. Training and Evaluation

The model was trained with positive and artificial neg-atives sample from the ImageNet-VID dataset [19]. Fortraining, the Adam optimizer [12] was used with in con-junction with a cyclical learning rate in accordance with theone cycle training policy [21]. With that, we were able totrain BOBBY2 in just 5 cycles, each comprising of a smallrandom subset - 25% or 352,879 samples - of the trainingdataset by leveraging on the super-convergence effect intro-duced by Smith and Topin [22]. The weight-decay valueused was 10e-2, a scene-objectness loss multiplier α of 10,batch size of 32 and learning rate of 3e-3.2

In terms of evaluation, BOBBY2 was benchmarked onthe GOT10-K, and ImageNet-VID dataset. For further val-

2Plots of losses can be found in Appendix A.

Figure 5. Success plot on the GOT-10k dataset.

idation, we have collected a small sample of footages takenfrom a custom UAV platform codenamed TU-3. It is a UAVplatform constructed by our UAV research group, with astandard DJI F450 platform, customized with an ELP web-camera connected to a Raspberry Pi for image logging (SeeFigure 3). The footages are highly challenging with charac-teristics such as target disappearance, full occlusions, cam-era warp, target deformation and presence of confusing in-stances, shown in Figure 4.

4. Experiments and Result Evaluations4.1. Accuracy and Precision

In this section, analysis of the corresponding bench-marks are presented. First, results from the GOT-10kbenchmark is presented to provide a comparison of perfor-mance in terms of Average Overlap (AO). Thereafter, weturn to a closer analysis on the tracker’s performance in dif-ferent configurations and scenarios to highlight its robust-ness, and conclude with a brief review of its performanceon the custom UAV dataset to demonstrate BOBBY2’s gen-eralization capability.

Performance on GOT-10k: To evaluate the performance,we had use the metrics as put forth by the dataset authors,Average Overlap (AO) and Success Rate (SR). The formeris the mean overlap of the predicted bounding box and theground truth, while the latter is the fraction of frames thathave an overlap value of higher than 0.5. Figure 5 is thesuccess plot of a collection of prominent SOT ranked bytheir area under the curve (AUC) scores. The top track-ers all utilizes the convolution operation one way or an-other as expected. The top 3 performing trackers com-prises of the SiamFC and SiamFC2 trackers with a scoreof 0.374 and 0.348 respectively. BOBBY2 outperforms allother trackers in this category, with 5.3% improvement in

Page 6: BOBBY2: Buffer Based Robust High-Speed Object Tracking · Filters to estimate the target’s state properties such as ve-locity, width and height with a fast discriminative saliency

Figure 6. Visualization of the input stack with 50% distractor in the exemplar buffer. z1 and z2 are positive exemplars while z3 and z4 aredistractors.

AUC over SiamFC22 and 13.2% over SiamFC. Other non-convolutional and non-deep models such as KCF performedsignificantly worst than those that are. GOTURN, one ofthe first convolutional Siamese trackers, performed betterthan later hybrid correlation-filter models such as CFNet.It has to be noted that unlike the trackers on the list, ourmodel was not pre-trained on the GOT-10k training set asit was only trained on ImageNet-VID, demonstrating itstransferability across dataset. As for FPS performance, adirect comparison could not be drawn due to the differencein hardware used in the benchmark of each tracker. There-fore the following discussion on FPS serves only to providea rough comparison for completeness. The fastest trackerby raw FPS is the Circulant Structure Kernel (CSK) [9]tracker with 122 FPS, followed by GOTURN with 109 FPS,and the Kernelized Correlation Filter (KCF) tracker with88 FPS. Both CSK and KCF were benchmarked on a ona 16-core CPU only, while GOTURN uses an NVIDIA Ti-tan X GPU. BOBBY2 comes in with 85 FPS on an NVIDIA2070-Super GPU, significantly faster than its closest coun-terpart, SiamFC which poses 24 FPS with an NVIDIA TitanX GPU. The raw speed of BOBBY2 without the trackingpipeline is capable of running at an average of 700 FPS onthe same platform. Upon closer inspection, the bottleneckin the tracking pipeline was found to be the image cropping-resizing procedure, which accounts for 93% of the total run-time.

Performance on ImageNet-VID: One particular featurethat makes BOBBY2 stand out is its exemplar buffer. Thisis of importance in tracking with models that performs dy-namic update of the exemplar which allows it to adapt tothe target’s deformations across time, crucial for long termtracking. However, this comes at a cost of increased chancesfor tracking failure on mid to long sequences, whereby anyerrors in the exemplar update will rapidly accumulate andcascade across frames. Another downside of such adaptivescheme is the increased computation overhead from need-ing to perform additional feature extraction at certain in-terval - the exemplar buffer refresh interval of BOBBY2

is dictated by ξ. We show that our model is able to per-form exactly such adaptive updates while being sufficientlyrobust against distractors in its buffer, hence negating theproblem of error accumulation significantly. In addition,our model imposes no additional computational overheadfrom performing feature extraction for the mid-sequenceexemplars updates by dropping the frames after featurizingduring such instances without incurring significant penaltyto accuracy, as demonstrated previously in Figure 2. Thisway, we are able to maintain a consistent computation costof approximately a single forward pass (assuming no addi-tional frame dropping is performed, η =∞) at each itera-tion throughout the tracking process.

We test the aforementioned distractor robustness by per-forming controlled tracking runs over a range of distrac-tor percentages whereby the exemplar buffer is explicitlyfilled with non-semantic distractors instead of a valid exem-plar. Figure 6 is a visualization of the tracker’s scene inputand exemplar buffer state for a single sample during saidtest. Table 2 is a detailed tabulation of its performance. Wehave conducted the test with a distractor percentage of 0%to 75%, and have found that model holds up very well withonly a 1% drop in success rate at 0.5 AO, even when the ma-jority of the buffer contents are only distractors. Therefore,it has been demonstrated that the exemplar buffer acts as re-dundancy in such cases, capable of hedging against targetdrifts due to error accumulation during adaptive tracking.

Performance on TU-3 UAV: For additional validation,the model was evaluated on a small yet challenging cus-tom UAV dataset. Furthermore, no fine-tuning was per-formed on this or any other additional dataset to furtherdemonstrate the transferability and generalization capabil-ity of our model. Table 3 is a breakdown of the results. Onthis particular dataset, the tracker had difficulty in provid-ing optimal results, and the drop in SR as the percentageof buffer distractor increases is much more noticeable thanseen in Table 2. The SR drop between having 0% and 75%buffer distractor for the AO threshold of 0.25 in the latter isapproximately 0.65% and 20% in the current dataset. We

Page 7: BOBBY2: Buffer Based Robust High-Speed Object Tracking · Filters to estimate the target’s state properties such as ve-locity, width and height with a fast discriminative saliency

% Distractor SR @ 0.25 AO SR @ 0.50 AO SR @ 0.75 AO SR @ 0.90 AO0% 0.922 0.845 0.663 0.363

25% 0.924 0.851 0.665 0.35750% 0.920 0.844 0.649 0.35075% 0.916 0.835 0.619 0.335

Table 2. Average Success Rate of BOBBY2 across different percentage of buffer exemplar distractors.

% Distractor SR @ 0.25 AO SR @ 0.50 AO SR @ 0.75 AO0% 0.474 0.111 0.010

25% 0.456 0.101 0.00550% 0.444 0.087 0.00375% 0.395 0.063 0.003

Table 3. Average Success Rate of BOBBY2 across different percentage of buffer exemplar distractors on custom UAV dataset.

attribute this to the significant difference in footage charac-teristics between the training ImageNet-VID dataset and thecustom UAV dataset. We believe that fine-tuning the trackeron these new challenging scenes will offer non-trivial im-provement. It should be noted however that despite notbeing trained at all on said dataset, BOBBY2 still exhib-ited an enduring robustness in its performance even with adistractor-majority buffer.

5. ConclusionIn this work, a novel buffer based Siamese object tracker

is introduced. Specifically in our tests, a buffer size β = 4was used. Hence, at each forward pass, the tracker hasaccess to 4 times more exemplar data than a conventionalSiamese tracker. Yet, it has 63% less parameters than avanilla AlexNet and a slightly lower number of floatingpoint operations. On the GOT-10k dataset, BOBBY2 man-ages to outperform all the trackers being benchmarked, with5.3% and 13.2% AUC improvements over the next twohighest performing trackers. There gap between BOBBY2and non-deep learning trackers are even more significant,as expected. In terms of speed, our tracker is capable ofrunning above real-time requirements at 85 FPS. One of theheuristic trick used for computation speed improvement isframe-dropping. In that regard, we have also shown thatthe tracker is capable of maintaining a near-optimal accu-racy even with an extreme frame-dropping regime whereevery interleaved frames are dropped, η = 2. As for ro-bustness, BOBBY2 has a significant advantage over otheradaptive trackers in dealing with the problem of potentialtarget drifts due to error accumulation in exemplar updates.We have shown that it could once again maintain a near-optimal accuracy in such scenarios, even when 75% of theexemplars in the buffer are non-semantic distractors. For afinal validation, BOBBY2 was evaluated on a challengingcustom UAV dataset collected with the TU-3 UAV, withoutfine-tuning. The performance on said dataset is sub-optimal

- SR of 0.474 at 0.25 AO without exemplar distractors -due to its high difficulty, and the significant difference be-tween it and the training dataset. Despite that, the perfor-mance is sufficient to demonstrate the generalizability ofBOBBY2 across data of different distributions; this can benon-trivially improved with additional fine-tuning.

The current work deals only with the domain of SOTwith minimal long term tracking capability. Thus, we lookto extend BOBBY2 to account for Multi Object Trackingand Long Term Object tracking cases in our future work.We conjecture that the exemplar buffer will offer a non-trivial advantage in these domains as well due to its adaptiveyet robust buffered embeddings. Another concern is to im-prove the run-time FPS by overcoming the severe cropping-resizing bottleneck in the tracking pipeline. In addition, weseek to subject future versions of BOBBY2 to model com-pression, weight pruning and quantization techniques in or-der to further improve on the speed performance.

References

[1] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, andP. H. Torr. Fully-convolutional siamese networks for objecttracking. In European conference on computer vision, pages850–865. Springer, 2016.

[2] E. T. Bokeno, T. M. Bort, S. S. Burns, M. Rucidlo, W. Wei,D. L. Wires, et al. Package delivery by means of an auto-mated multi-copter uas/uav dispatched from a conventionaldelivery vehicle, Mar. 13 2018. US Patent 9,915,956.

[3] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M.Lui. Visual object tracking using adaptive correlation filters.In 2010 IEEE Computer Society Conference on ComputerVision and Pattern Recognition, pages 2544–2550. IEEE,2010.

[4] G. R. Bradski. Real time face and object tracking as acomponent of a perceptual user interface. In ProceedingsFourth IEEE Workshop on Applications of Computer Vision.WACV’98 (Cat. No. 98EX201), pages 214–219. IEEE, 1998.

Page 8: BOBBY2: Buffer Based Robust High-Speed Object Tracking · Filters to estimate the target’s state properties such as ve-locity, width and height with a fast discriminative saliency

[5] P. Burt, J. Bergen, R. Hingorani, R. Kolczynski, W. Lee,A. Leung, J. Lubin, and H. Shvayster. Object tracking with amoving camera. In [1989] Proceedings. Workshop on VisualMotion, pages 2–12. IEEE, 1989.

[6] M. Danelljan, G. Hager, F. Khan, and M. Felsberg. Accuratescale estimation for robust visual tracking. In British Ma-chine Vision Conference, Nottingham, September 1-5, 2014.BMVA Press, 2014.

[7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. In Proceedings of the IEEE con-ference on computer vision and pattern recognition, pages770–778, 2016.

[8] D. Held, S. Thrun, and S. Savarese. Learning to track at 100fps with deep regression networks. In European Conferenceon Computer Vision, pages 749–765. Springer, 2016.

[9] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. Ex-ploiting the circulant structure of tracking-by-detection withkernels. In European conference on computer vision, pages702–715. Springer, 2012.

[10] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-speed tracking with kernelized correlation filters. IEEEtransactions on pattern analysis and machine intelligence,37(3):583–596, 2014.

[11] L. Huang, X. Zhao, and K. Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild.arXiv preprint arXiv:1810.11981, 2018.

[12] D. P. Kingma and J. Ba. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980, 2014.

[13] G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neu-ral networks for one-shot image recognition. In ICML deeplearning workshop, volume 2, 2015.

[14] D. Koller, K. Daniilidis, and H.-H. Nagel. Model-based ob-ject tracking in monocular image sequences of road traf-fic scenes. International Journal of Computer 11263on,10(3):257–281, 1993.

[15] M. Kristan, A. Leonardis, J. Matas, M. Felsberg,R. Pflugfelder, L. Cehovin Zajc, T. Vojir, G. Bhat,A. Lukezic, A. Eldesokey, et al. The sixth visual object track-ing vot2018 challenge results. In Proceedings of the Euro-pean Conference on Computer Vision (ECCV), pages 0–0,2018.

[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InAdvances in neural information processing systems, pages1097–1105, 2012.

[17] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu. High perfor-mance visual tracking with siamese region proposal network.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 8971–8980, 2018.

[18] E. Peeters, E. Teller, W. G. Patrick, and S. Brin. Provid-ing a medical support device via an unmanned aerial vehicle,Feb. 3 2015. US Patent 8,948,935.

[19] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,A. C. Berg, and L. Fei-Fei. ImageNet Large Scale VisualRecognition Challenge. International Journal of ComputerVision (IJCV), 115(3):211–252, 2015.

[20] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556, 2014.

[21] L. N. Smith. A disciplined approach to neural networkhyper-parameters: Part 1–learning rate, batch size, momen-tum, and weight decay. arXiv preprint arXiv:1803.09820,2018.

[22] L. N. Smith and N. Topin. Super-convergence: Very fasttraining of residual networks using large learning rates.2018.

[23] R. S. Stephens. Real-time 3d object tracking. Image andVision Computing, 8(1):91–96, 1990.

[24] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface:Closing the gap to human-level performance in face verifi-cation. In Proceedings of the IEEE conference on computervision and pattern recognition, pages 1701–1708, 2014.

[25] B. Uzkent and Y. Seo. Enkcf: Ensemble of kernelizedcorrelation filters for high-speed object tracking. In 2018IEEE Winter Conference on Applications of Computer Vision(WACV), pages 1133–1141. IEEE, 2018.

[26] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, andP. H. Torr. End-to-end representation learning for correla-tion filter based tracking. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages2805–2813, 2017.

[27] Q. Wang, L. Zhang, L. Bertinetto, W. Hu, and P. H. Torr.Fast online object tracking and segmentation: A unifying ap-proach. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 1328–1338, 2019.

[28] Y. Wu, Y. Sui, and G. Wang. Vision-based real-time aerialobject localization and tracking for uav sensing system.IEEE Access, 5:23969–23978, 2017.

[29] X. Xue, Y. Li, and Q. Shen. Unmanned aerial vehicle ob-ject tracking by correlation filter with adaptive appearancemodel. Sensors, 18(9):2751, 2018.

[30] X. Yang, R. Zhu, J. Wang, and Z. Li. Real-time object track-ing via least squares transformation in spatial and fourierdomains for unmanned aerial vehicles. Chinese Journal ofAeronautics, 2019.

[31] A. Yilmaz, O. Javed, and M. Shah. Object tracking: A sur-vey. Acm computing surveys (CSUR), 38(4):13, 2006.

[32] S. Zagoruyko and N. Komodakis. Learning to compare im-age patches via convolutional neural networks. In Proceed-ings of the IEEE conference on computer vision and patternrecognition, pages 4353–4361, 2015.

[33] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu.Distractor-aware siamese networks for visual object track-ing. In Proceedings of the European Conference on Com-puter Vision (ECCV), pages 101–117, 2018.

Page 9: BOBBY2: Buffer Based Robust High-Speed Object Tracking · Filters to estimate the target’s state properties such as ve-locity, width and height with a fast discriminative saliency

A. AppendicesA.1. Appendix A

Figure 7. Training loss of BOBBY2 on the augmented ImageNet-VID dataset.

Figure 8. Validation loss of BOBBY2 on the augmentedImageNet-VID dataset.