10
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS SAT: Single-shot Adversarial Tracker Qiangqiang Wu, Hanzi Wang, Senior Member, IEEE, Yi Liu, Liming Zhang, Member, IEEE, and Xinbo Gao, Senior Member, IEEE Abstract —Deep learning-based tracking methods have shown favorable performance on multiple benchmarks. However, most of these methods are not designed for real- time video surveillance systems due to the complex online optimization process. In this paper, we propose a single- shot adversarial tracker (SAT) to efficiently locate objects of interest in surveillance videos. Specifically, we propose a lightweight convolutional neural network-based gener- ator, which fuses multi-layer feature maps to accurately generate the target probability map (TPM) for tracking. To more effectively train the generator, an adversarial learning framework is presented. During the online tracking stage, the learned TPM generator can be directly employed to generate the target probability map corresponding to the searching region in a single shot. The proposed SAT can lead to the average tracking speed of 212 FPS on a single GPU, while still achieving the favorable performance on several popular benchmarks. Furthermore, we also present a variant of SAT by considering both scale estimation and online updating in SAT, which achieves better accuracy than SAT while still maintaining very fast tracking speed (i.e., exceeding 100 FPS). Index Terms—Visual object tracking, generative adver- sarial network, target probability map. I. I NTRODUCTION V ISUAL object tracking is one of the fundamental and challenging tasks in computer vision. Given the initial state of the target at the first frame, visual object tracking aims to accurately locate the target at subsequent frames in a video. In practice, visual tracking techniques have been widely used in many real-time industrial tasks, such as intelligent transportation systems [1], robotics [2], human- computer interaction [3], vehicle navigation [4], etc. Thus developing highly efficient and robust visual tracking methods are essential for many industrial applications. In recent years, despite the superior performance achieved by deep learning- based tracking methods on the popular tracking benchmarks [3], [5], most of these methods cannot operate in real time, which severely limits their applications in industry. In this This work was supported by the National Natural Science Foundation of China under Grant U1605252, 61432014, 61872307, and 61671339, and by the Multi-Year Research Grants of the University of Macau under Grant MYRG 2017-00218-FST and MYRG 2018-00111-FST. (Corresponding author: Hanzi Wang.) Q. Wu, H. Wang, and Y. Liu are with the School of Informatics, Xiamen University, Xiamen 361005, China (e-mail: [email protected]; [email protected]; liuyi [email protected]). L. Zhang is with the Department of Computer and Information Sci- ence, Faculty of Science and Technology, University of Macau, Taipa, Macau, China (e-mail: [email protected]). X. Gao is with the School of Electronic Engineering, Xidian University, Xi’an 710071, China (e-mail: [email protected]). paper, we mainly focus on designing a highly efficient visual tracking method, which can be directly employed in practical industrial applications. The deep convolutional neural networks (CNNs) have been successfully applied to many computer vision tasks, such as object detection [6], semantic segmentation [7], person re- identification [8], and so on. Inspired by the success of CNNs, the computer vision community has exploited the advantages of CNNs to develop some deep trackers in recent years. The current deep trackers can be roughly divided into two types: deep feature-based correlation filter (CF) trackers and CNN- based deep trackers. The former uses deep features instead of handcrafted ones to perform correlation filter tracking, while the latter employs end-to-end trained CNNs for online tracking. Note that both CF and CNN-based deep trackers are usually based on discriminative models. Despite the notable performance achieved by these discriminative model-based trackers, they still have several limitations. One of the main limitations is that many of these trackers cannot operate in real-time due to the complex optimization process or the time-consuming feature extraction step. For example, the classification-based CNN trackers (e.g., MDNet [9] and SANet [10]) can only run at 1 FPS even on a high-end GPU, though they achieve competitive tracking performance on the bench- marks [3], [5]. To improve the tracking speed of classification- based CNN trackers, some deep Siamese network-based track- ers are proposed. These trackers can achieve above real-time tracking speed, but their tracking performance is not satisfying due to the lack of online updating steps. The two recently developed regression-based CNN trackers [11], [12] have reported to significantly improve the com- putational speed (exceeding 100 FPS) by offline pretraining CNNs on a large-scale training dataset. It is worth pointing out that both the two trackers perform online tracking without using any online learning strategies. Due to the high efficiency, the two regression-based CNN trackers can fully satisfy the real-time requirement of industrial systems. However, since they lack of online updating step, they are very sensitive to significant target appearance variations, thus resulting in low tracking accuracy compared to the other top-performing online updated trackers (such as MDNet and CCOT). Therefore, although much progress has been made in recent years, it is still challenging to design a highly efficient and robust CNN tracker that can be directly employed in practical industrial applications. Generative adversarial networks (GANs) [13] have been successfully applied to different computer vision tasks in recent years, such as person re-identification [8], neural style

SAT: Single-shot Adversarial Tracker

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SAT: Single-shot Adversarial Tracker

IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

SAT: Single-shot Adversarial Tracker

Qiangqiang Wu, Hanzi Wang, Senior Member, IEEE, Yi Liu, Liming Zhang, Member, IEEE,and Xinbo Gao, Senior Member, IEEE

Abstract—Deep learning-based tracking methods haveshown favorable performance on multiple benchmarks.However, most of these methods are not designed for real-time video surveillance systems due to the complex onlineoptimization process. In this paper, we propose a single-shot adversarial tracker (SAT) to efficiently locate objectsof interest in surveillance videos. Specifically, we proposea lightweight convolutional neural network-based gener-ator, which fuses multi-layer feature maps to accuratelygenerate the target probability map (TPM) for tracking. Tomore effectively train the generator, an adversarial learningframework is presented. During the online tracking stage,the learned TPM generator can be directly employed togenerate the target probability map corresponding to thesearching region in a single shot. The proposed SAT canlead to the average tracking speed of 212 FPS on a singleGPU, while still achieving the favorable performance onseveral popular benchmarks. Furthermore, we also presenta variant of SAT by considering both scale estimation andonline updating in SAT, which achieves better accuracythan SAT while still maintaining very fast tracking speed(i.e., exceeding 100 FPS).

Index Terms—Visual object tracking, generative adver-sarial network, target probability map.

I. INTRODUCTION

V ISUAL object tracking is one of the fundamental andchallenging tasks in computer vision. Given the initial

state of the target at the first frame, visual object trackingaims to accurately locate the target at subsequent frames ina video. In practice, visual tracking techniques have beenwidely used in many real-time industrial tasks, such asintelligent transportation systems [1], robotics [2], human-computer interaction [3], vehicle navigation [4], etc. Thusdeveloping highly efficient and robust visual tracking methodsare essential for many industrial applications. In recent years,despite the superior performance achieved by deep learning-based tracking methods on the popular tracking benchmarks[3], [5], most of these methods cannot operate in real time,which severely limits their applications in industry. In this

This work was supported by the National Natural Science Foundationof China under Grant U1605252, 61432014, 61872307, and 61671339,and by the Multi-Year Research Grants of the University of Macauunder Grant MYRG 2017-00218-FST and MYRG 2018-00111-FST.(Corresponding author: Hanzi Wang.)

Q. Wu, H. Wang, and Y. Liu are with the School of Informatics, XiamenUniversity, Xiamen 361005, China (e-mail: [email protected];[email protected]; liuyi [email protected]).

L. Zhang is with the Department of Computer and Information Sci-ence, Faculty of Science and Technology, University of Macau, Taipa,Macau, China (e-mail: [email protected]).

X. Gao is with the School of Electronic Engineering, Xidian University,Xi’an 710071, China (e-mail: [email protected]).

paper, we mainly focus on designing a highly efficient visualtracking method, which can be directly employed in practicalindustrial applications.

The deep convolutional neural networks (CNNs) have beensuccessfully applied to many computer vision tasks, such asobject detection [6], semantic segmentation [7], person re-identification [8], and so on. Inspired by the success of CNNs,the computer vision community has exploited the advantagesof CNNs to develop some deep trackers in recent years. Thecurrent deep trackers can be roughly divided into two types:deep feature-based correlation filter (CF) trackers and CNN-based deep trackers. The former uses deep features insteadof handcrafted ones to perform correlation filter tracking,while the latter employs end-to-end trained CNNs for onlinetracking. Note that both CF and CNN-based deep trackers areusually based on discriminative models. Despite the notableperformance achieved by these discriminative model-basedtrackers, they still have several limitations. One of the mainlimitations is that many of these trackers cannot operate inreal-time due to the complex optimization process or thetime-consuming feature extraction step. For example, theclassification-based CNN trackers (e.g., MDNet [9] and SANet[10]) can only run at 1 FPS even on a high-end GPU, thoughthey achieve competitive tracking performance on the bench-marks [3], [5]. To improve the tracking speed of classification-based CNN trackers, some deep Siamese network-based track-ers are proposed. These trackers can achieve above real-timetracking speed, but their tracking performance is not satisfyingdue to the lack of online updating steps.

The two recently developed regression-based CNN trackers[11], [12] have reported to significantly improve the com-putational speed (exceeding 100 FPS) by offline pretrainingCNNs on a large-scale training dataset. It is worth pointingout that both the two trackers perform online tracking withoutusing any online learning strategies. Due to the high efficiency,the two regression-based CNN trackers can fully satisfy thereal-time requirement of industrial systems. However, sincethey lack of online updating step, they are very sensitive tosignificant target appearance variations, thus resulting in lowtracking accuracy compared to the other top-performing onlineupdated trackers (such as MDNet and CCOT). Therefore,although much progress has been made in recent years, it isstill challenging to design a highly efficient and robust CNNtracker that can be directly employed in practical industrialapplications.

Generative adversarial networks (GANs) [13] have beensuccessfully applied to different computer vision tasks inrecent years, such as person re-identification [8], neural style

Page 2: SAT: Single-shot Adversarial Tracker

IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

transfer [14], etc. Although GANs can effectively capture datamanifold and learn discriminative feature representations, theyare difficult to be directly applied to the visual tracking task.The main challenge is that only a small number of training datais available during the tracking process, which is not sufficientto train GANs well. And training GANs in an online manneris very time-consuming, which severely degrades the trackingspeed. Recently, several GAN-based trackers [15]–[17] havebeen proposed. Despite the favorable tracking performanceachieved by these trackers on the benchmarks, these trackerscan only run at low frame rates. Therefore, how to explore thegeneration power of GANs to design a highly efficient trackeris still an open and challenging problem.

Different from prior work that formulates visual trackingas a classification problem, in this paper, we frame visualtracking as a target probability generation problem. Specif-ically, we propose a lightweight fully convolutional neuralnetwork-based generator, which combines multiple featuremaps to predict the target probability map (TPM). In order toeffectively train the proposed TPM generator, besides using theMean Squared Error (MSE) loss to train the TPM generator ina standard supervised manner, we propose an adversarial learn-ing framework, which trains the proposed TPM generator in anadversarial manner. By playing against with the discriminator,our TPM generator tends to generate more accurate targetprobability maps and thus the tracking performance obtainedby our SAT can be significantly improved. The experimentalresults demonstrate the effectiveness of the proposed adver-sarial learning framework for tracking.

Based on the well-learned TPM generator, we present oursingle-shot adversarial tracker (SAT) for online tracking. Giventhe distribution of a target template and a searching region, theproposed TPM generator in SAT can directly predict the targetprobability map (see Fig. 1) corresponding to the searchingregion. Based on the distribution of the generated targetprobability map, the target center location and the target scalecan be effectively estimated (see Part D in Section III). It isworth noting that our SAT is significantly faster than previousdiscriminative model-based CNN trackers (e.g., it is about 200and 2000 times faster than MDNet and SANet, respectively).Meanwhile, it also achieves favorable performance on popularbenchmarks. In addition, we also present a variant of SAT,which can achieve better tracking accuracy than SAT whilestill maintaining the fast tracking speed (i.e., exceeding 100FPS). The contributions of this work can be summarized asfollows:

(1) We propose a lightweight fully convolutional neuralnetwork-based target probability map (TPM) generator, whichcombines multi-layer feature maps with different spatial reso-lutions to generate the target probability map.

(2) To effectively train the proposed TPM generator, wepresent an adversarial learning framework. Specifically, amulti-input discriminator is designed to enforce the generatorto generate the accurate target probability map. Through theadversarial learning, the tracking performance can be signifi-cantly boosted.

(3) Based on the learned TPM generator, we propose asingle-shot adversarial tracker (SAT). The proposed SAT can

Fig. 1. Visualization of the target probability maps generated by theproposed TPM generator on several video sequences. For each videosequence, the target template is shown in the top left corner of the imagein the first column of each row. The searching region and estimatedtarget state in each frame are respectively represented as the blueregion and yellow rectangle.

run at the average tracking speed of 212 FPS on a single GPU.In addition, to achieve better tracking performance, we alsopresent a variant of SAT by introducing both scale estimationand online updating to SAT.

Experiments are conducted on five popular datasets includ-ing OTB-50, OTB-2013, OTB-100, VOT-2016 and UAV20L.The results show that the proposed deep generative model-based SAT significantly outperforms the other state-of-the-artgenerative model-based trackers in terms of both precisionand tracking speed. Moreover, the proposed SAT performsfavorably against several other state-of-the-art trackers on theUAV20L dataset, which demonstrates that it is well-suited forthe challenging task of long-term object tracking.

II. RELATED WORK

A. Generative Model-based Tracking

Generative model-based trackers mainly focus on modelingtarget appearance. For example, the VTD [18] and VTS [19]trackers respectively use multiple models (i.e., the appearanceand motion models) to handle with appearance variations. Meiet al. [20] employ sparse representations to perform onlinevisual tracking. This technique is widely adopted in manyworks [21], [22] to facilitate the learning of target appearancemodel. For ASLA [23], a structural local sparse appearancemodel is built to distinguish the target from the background.Zhou et al. [24] propose to use a sparse collaborative modelto handle with drastic appearance variations. In SMOG [25],a joint spatial-color appearance modeling method is presentedfor particle filter tracking. However, these generative trackersusually achieve relatively lower accuracy compared with thediscriminative model-based trackers. Meanwhile, due to thetime-consuming appearance modeling, most of these trackerscannot operate in real-time. To improve both the trackingaccuracy and speed, we propose a novel deep generativemodel-based tracker, which can be trained offline to learnbetter feature representations than previous generative trackers.

Page 3: SAT: Single-shot Adversarial Tracker

IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

B. CNN-based TrackingThe CNN-based trackers usually follow a classification

or regression schema to perform online tracking. One ofthe representative classification model-based CNN trackers isMDNet [9], which treats visual tracking as a binary classifi-cation problem. Based on MDNet, some improved variants ofMDNet have been proposed, including online regularization[26], reciprocative learning [27], adversarial learning [15],reinforcement learning [28], recurrent learning [10] and metalearning [29]. However, due to the time-consuming onlinelearning steps in these trackers, their tracking speed is stilllimited. To remove the online learning steps, the recentlyproposed deep tracker SiamFC [30], aims to learn a deepSiamese network in an offline manner, and then apply the well-learned Siamese network for real-time tracking. Starting fromSiamFC, some extensions including flow correlation tracking[31], residual attention learning [32], bounding box regression[33] and distractor-aware learning [34]. The pioneering workof regression-based trackers is GOTURN [11], which trainsa deep regression network to predict the target location inconsecutive frames. In Re3 [12], a LSTM network is used fortarget state prediction. Note that the two regression trackers(GOTURN and Re3) can operate at 165 FPS and 114 FPSon a single GPU, respectively. However, the regression-basedtrackers still have a large accuracy gap compared with theclassification-based trackers, though the former can achievemuch faster tracking speed. In this paper, we propose adeep generative model-based regression tracker that can notonly obtain favorable performance but also achieve fasterspeed than other state-of-the-art trackers (e.g., GOTURN andSiamFC).

C. Generative Adversarial Learning-based TrackingRecently, several generative adversarial learning-based

trackers have been proposed. In VITAL [15] and SINT++[17], the authors employ the generative adversarial networks(GANs) to generate hard positive samples in order to obtainmore robust online CNN classifiers. In [16], an adversariallearning framework is presented to jointly optimize the re-gression network and the classification network. Wu et al.[35] propose an adversarial hallucination method to facilitatethe online classification model learning for visual tracking.However, these trackers cannot be directly applied in industrialtasks due to their low tracking speed. In this paper, to bridgethe gap between industry and academia, we propose a GAN-based highly efficient tracker.

III. PROPOSED METHOD

In this section, we firstly propose a lightweight fully con-volutional neural network-based TPM generator, which aggre-gates multi-layer feature maps with different spatial resolutionsto estimate the target probability distributions. Secondly, wepresent an adversarial learning framework, where a multi-inputdiscriminator is designed to more effectively train the proposedTPM generator by playing against with the discriminator.Thirdly, based on the learned TPM generator, we propose thesingle-shot adversarial tracker (SAT) and a variant of SAT,

Mixture

Loss

TPM Generator

Scale Estimation

Input

Template Updating

Training

Target Probability Map

Fig. 2. The flow chart of the proposed method, which consists of twoparts: (1) the training part indicated by the red arrow and (2) the single-shot visual tracking part indicated by the blue arrows.

which considers both scale estimation and online updatingin SAT, for better accuracy. The flow chart of the proposedmethod is shown in Fig. 2. The overall steps of the proposedSAT is shown in Algorithm 1. We also list the abbreviationsof this paper in Table I for easy reading.

A. Single-shot TPM Generator

We design a lightweight yet effective fully convolutionalneural network-based TPM generator to generate the targetprobability map in a single shot. The overall architecture ofthe TPM generator is shown in Fig. 3. The encoder part(shown in the left part of Fig. 3) includes two streams, whichare respectively termed as the template stream (top left part)and the searching stream (bottom left part). Both streamshave five convolutional layers, and each convolutional layeris followed by a Leaky ReLU (LR) operation and a BatchNormalization (BN) operation. These two streams share thesame weights. In order to fuse multi-resolution feature mapsextracted from different layers, both the target stream andthe searching stream apply max pooling operations to makethe multi-layer feature maps have the same spatial resolution.Note that in the encoder part, strided convolutions are usedto perform downsampling instead of pooling operations. Themain reason is that pooling operations may lead to blurredtarget probability maps, while strided convolutions are moreeffective to generate non-blurred target probability maps.

As shown in Fig. 3, in the decoder part, we first use the 1×1convolutional kernel to reduce the dimensions of the featuresobtained from the encoder part. Then, the deconvolutionoperations [36], which consist of the convolution operationswith fractional strides, are used to restore the compressedfeatures to the original resolution (128× 128) in the encoderpart. Similar to the encoder part, each deconvolution operationis followed by a LR operation and a BN operation. All theconvolution and deconvolution operations employed in ourTPM generator use a kernel of size 4×4 with a stride of 2.

B. Adversarial Learning Framework

We present an adversarial learning framework to facilitatethe learning of the proposed TPM generator. The proposedframework consists of the TPM generator and a designedmulti-input discriminator. The multi-input discriminator has6 convolutional layers, where the prior 5 layers have the samearchitecture as the template or searching stream in the encoderpart in the proposed TPM generator. The last layer employs a

Page 4: SAT: Single-shot Adversarial Tracker

IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

1×1

512256

8×8

&

&

16×16128

&

32

Generated Target Probability Map

DeConv. 1

DeConv. 2 DeConv. 4

BN

LR

128×12864×64

64512

8×8Conv. 1Conv. 4 Conv. 5

512

LR

Template

Searching

128×128

Conv. 316×16

256

4×4

BN

LR

BN

64×64

64 512

8×8Conv. 1Conv. 4 Conv. 5

512

LR

Conv. 316×16

256

4×4

BN

LR

BN

Deconv

: Concatenate : 4×4 Convolutional Filter

LR: Leaky ReLU & : 4×4 DeConvolutional Filter

BN: Batch Normalization

: 1x1x512 Conv.

Outputs: @CxW/NxH/N

N

Inputs:@CxWxH

4

24

2

MaxPooling+LR

Fig. 3. Overall architecture of the proposed fully convolutional neural network-based TPM generator. Given a target template and a searchingregion, the proposed TPM generator fuses multi-layer features with different spatial resolutions to effectively estimate the target probability map.

TABLE ILIST AND DESCRIPTION OF ABBREVIATIONS.

Abbreviation DescriptionSAT Single-shot Adversarial TrackerTPM Target Probability MapMSE Mean Squared ErrorGAN Generative Adversarial NetworkCNN Convolutional Neural NetworkDPR Distance Precision Rate at a Distance Threshold of 20OSR Overlap Success Rate at an Overlap Threshold of 0.5AUC Area Under the Curve

convolutional operation using a standard kernel of size 4×4.For the discriminator, in order to further maintain the balancewith the generator, the discriminator also applies the same4×4 convolutional block as the generator. The overall pipelineof the proposed adversarial learning framework is shown inFig. 4. As can be seen, the inputs of the discriminator arefrom three information sources: 1) the target template patch,2) the searching region patch, and 3) the generated or theground-truth target probability map, which indicates the targetcenter location and the target scale. By playing against withthe discriminator, the proposed TPM generator learns morediscriminative features, which helps to generate more accuratetarget probability maps in a single shot.

During the adversarial learning stage, the proposed TPMgenerator is trained to generate the accurate target probabilitymap such that the discriminator cannot distinguish the ground-truth target probability map from the generated target probabil-ity map, which facilitates the TPM generator to fit the ground-truth target probability map. Thus, by applying the adversariallearning, the TPM generator can accurately generate targetprobability maps for robust visual tracking.

C. Mixture Loss for TrainingIn order to more effectively train the proposed TPM gen-

erator, we jointly use two loss functions, including both theMean Squared Error (MSE) loss and the adversarial loss [13].The MSE loss matches the distribution of the generated targetprobability map to the data distribution of the target template.Let z and s denote a target template patch and a searchingregion patch, respectively. Then, the MSE loss between agenerated target probability map and the ground-truth targetprobability map can be formulated as:

Encoder

Template Searching

Gen Proba. Map

Ground-truth

Fake/Real

Generator Discriminator

Fake Dist. Real Dist.

Decoder

Fig. 4. Overall architecture of our adversarial learning framework.

L(rm,n, Gz,s) = ‖rm,n −Gz,s‖2, (1)

where Gz,s represents the generated target probability map interms of both the target template z and the searching region s.rm,n indicates the ground-truth target probability map centeredat the target center location (m,n). The value of rm,n can be

calculated by using rm,n = e−(2|x−m|/W )2+(2|y−n|/H)2

2σ2 , wherex, y ∈ {1, . . . , 128}, W and H respectively denote the widthand height of the target, and σ denotes the linear kernel width.

However, solely using the MSE loss to train our TPMgenerator cannot achieve the optimal results. This is becausethat the MSE loss only focuses on minimizing the overallerrors and it does not pay more attention on estimating thetarget center location. In order to facilitate the generator tolearn better feature representations and generate more accuratetarget probability maps, we propose to use the adversarial lossto train the TPM generator as follows:

minG

maxD

E[logD(rm,n) + log (1−D(Gz,s))], (2)

where the generator G tries to generate the target probabilitymap Gz,s, which is close to the ground-truth target probabilitymap rm,n, while the discriminator D aims to distinguishGz,s from rm,n. By playing against the discriminator, theproposed TPM generator can accurately fit the ground-truthtarget probability map.

Page 5: SAT: Single-shot Adversarial Tracker

IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

Algorithm 1: The Proposed Tracking MethodInput: The target template patch z1 at the first frame and

the offline-learned TPM generator.Output: The estimated target center position (xt, yt),

target width Wt and height Ht at the t-th frame.1 Initialize: obtain the target stream output p1 by inputing

z1 to the TPM generator.;2 for t = 2, 3, ... do3 Estimate the target center position (xt, yt) at the t-th

frame using Eqn. (4);4 Evaluate various target candidates with the width

ckWt−1 and the height ckHt−1 centered at (xt, yt)using Eqns. (5) and (6);

5 Estimate the target width Wt and height Ht at thet-th frame using Eqns. (7) and (8);

6 Update the target stream output pt using Eqn. 9;7 end

By combining the two loss functions (1) and (2), the overallloss function of the proposed adversarial learning frameworkis written as:

minG

maxD

E[L(rm,n, Gz,s) + λ logD(rm,n)

+λ log (1−D(Gz,s))],(3)

where λ is a parameter that controls the relative importanceof the adversarial loss and the MSE loss.

D. Single-shot Adversarial TrackerAfter training the proposed single-shot TPM generator, we

directly apply it for visual tracking. We call this basic trackeras the Single-shot Adversarial Tracker (SATbasic). Given atarget template patch z and a searching region patch s,SATbasic can effectively generate the target probability map ina single shot. Note that the generated target probability maphas the same spatial resolution as the searching region patch s,and the coordinates of the maximum probability score in thetarget probability map are treated as the target center location.The whole tracking process can be described as follows:

(x, y) = argmaxx,y

(Gz,s(x, y)), (4)

where Gz,s(x, y) returns the probability score at the position(x, y) of the generated target probability map Gz,s.

Scale estimation. After we obtain the generated targetprobability map Gt

z,s at the t-th frame, we can effectivelyestimate the target scale (i.e., the width Wt and the heightHt of the target) based on Gt

z,s. Let K be the numberof scales. For each k ∈ {

⌊−K−1

2

⌋,⌊−K−2

2

⌋, . . . ,

⌊K−12

⌋},

we respectively evaluate the candidate target width ckWt−1and height ckHt−1 in terms of the generated Gt

z,s. Here, cdenotes the scale factor. The scale estimation process can beformulated as:

kW = argmaxk

‖rtx,y(ckWt−1, 0)−Gtz,s(x, y)‖2, (5)

kH = argmaxk

‖rtx,y(0, ckHt−1)−Gtz,s(x, y)‖2, (6)

TABLE IITHE DPRS, OSRS AND SPEED (I.E., FPS ON A GPU) OBTAINED BYSATbasic AND ITS VARIANTS ON THE OTB-2015 DATASET. THE BEST

VALUE IS HIGHLIGHTED BY BOLD.

Method DPR (%) OSR (%) SpeedSATbasic 69.3 58.2 212.2SATscale 69.3 60.8 130.9

SATupdating 70.1 59.1 172.1SATall 70.9 62.0 123.6

where rtx,y(ckWt−1, 0) = e−

(2|x−x|/ckWt−1)2

2σ2 ,

rtx,y(0, ckHt−1) = e−

(2|y−y|/ckHt−1)2

2σ2 , Gtz,s(x, y) and

Gtz,s(x, y) return the values at the location (x, y) and (x, y)

in the generated target probability map Gtz,s, respectively.

Note that x ∈ {bx−Wt−1/4c , . . . , bx+Wt−1/4c} andy ∈ {by −Ht−1/4c , . . . , by +Ht−1/4c}, thus rtx,y(, ) andGt

z,s(, ) represent one-dimensional vectors. As a result, thewidth Wt and the height Ht of the target at the t-th framecan be estimated as:

Wt = (1− η)Wt−1 + ηckWWt−1, (7)

Ht = (1− η)Ht−1 + ηckHHt−1, (8)

where η is the learning rate.Template updating. To effectively handle the target appear-

ance variation online, a simple yet effective updating strategyis presented. Let pt−1 be the template stream output at the(t− 1)-th frame, we update the the template stream output atthe t-th frame by using a moving average ω as follows:

pt = (1− ω)pt−1 + ωpt. (9)

The obtained pt is then used in our TPM generator to performtarget state estimation in the following frames.

The overall procedure of the proposed SAT is shown in Al-gorithm 1. As can be seen, the proposed SAT only needs foursteps to perform efficient online tracking with the target scaleestimation and the target template updating steps. Specifically,given a target template in a new frame, the proposed TPMgenerator can effectively generate the target probability mapand estimate the target center position in a single shot. Basedon the generated target probability map, the new target scaleand the updated target template can be efficiently obtained viathe last three steps shown in Algorithm 1. Therefore, our SATcan run completely feed-forward and only needs one single-shot generation to estimate target state in each frame, thusachieving above real-time performance.

IV. EXPERIMENTS

A. Experimental Setting

Evaluation metrics. We adopt the distance precision(DP) plot and the overlap success (OS) plot as the evalu-ation metrics. Given the ground truth target center location(og1, o

g2) and the estimated target center location (oe1, e

e2),

their Euclidean distance d can be calculated as d =√(og1 − oe1)2 + (og2 − oe2)2. To measure the overall precision

performance of evaluated trackers, the DP plot is used to depictthe percentage of the successfully tracked frames, in which theEuclidean distance between the ground-truth target center and

Page 6: SAT: Single-shot Adversarial Tracker

IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

1 2 3 4 5 6 7 8 9 10

Epoch

0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

Mea

n E

rror

TrainingGenerator@lastGenerator@last_twoGenerator@last_three

1 2 3 4 5 6 7 8 9 10

Epoch

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Mea

n E

rror

10-3 ValidationGenerator@lastGenerator@last_twoGenerator@last_three

Fig. 5. Training (left) and validation (right) mean error curves obtainedby the generator with different feature combinations including genera-tor@last, generator@last two and generator@last three.

0 10 20 30 40 50

Location error threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Pre

cisi

on

Precision plots of OPE on OTB-50

SATbasic

[0.618]

SAT-MSE [0.554]

0 10 20 30 40 50

Location error threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Pre

cisi

on

Precision plots of OPE on OTB-2015

SATbasic

[0.693]

SAT-MSE [0.653]

Fig. 6. DP plots obtained by SAT and SAT-MSE on the OTB-50 andOTB-2015 datasets by using the OPE metric. DPRs are illustrated inbrackets.

the estimated target center is within a distance threshold. TheOS plot shows the percentage of the frames, where the overlapratio between the ground-truth target bounding box and theestimated target bounding box is higher than a threshold. Weuse the open source code provided by the OTB committee[3] to generate both the DP and OS plots obtained by allthe trackers. Let vg and ve be the ground truth bounding boxand estimated bounding box, respectively. The overlap ratiou is defined as u =

|vg⋂

ve||vg

⋃ve| , where

⋂and

⋃respectively

denote the intersection and union operations of two boundingboxes. We report both the DP rates at a threshold of 20 pixels(DPR), the OS rates at an overlap ratio (i.e., u) threshold of 0.5(OSR) and the Area Under the Curve (AUC) of each tracker’sOS plot as done in [3], [5]. For VOT-2016, the expectedaverage overlap (EAO), accuracy and failures are adopted toquantitatively evaluate trackers. Specifically, we use the toolkitprovided by the VOT committee [37] to evaluate each tracker.

Implementation details. Since the discriminator is mucheasier to be trained than the generator in an adversariallearning framework, directly applying the randomly initializedTPM generator to train together with the discriminator maymake the discriminator too successful or powerful, such thatthe TPM generator vanishes gradients and thus learns nothing.To stable the training of the generator, we split the training intotwo steps. Firstly, we train the generator with the MSE lossin Eqn. (1) for 3 epochs with a batch size of 256. Secondly,the proposed TPM generator and the discriminator are jointlytrained with the mixture loss in Eqn. (3) for 10 epochs witha batch size of 256. The data used in the two-stage trainingis collected from three datasets: ILSVRC-2015 [38], ALOV300++ [39], and NUS-Pro [40]. ILSVRC-2015 contains 4417videos of 30 different objects. In this dataset, we use 3862videos in the dataset for training and use the remaining 555videos for validation. The ALOV 300++ and NUS-Pro datasets

0 10 20 30 40 50

Location error threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Pre

cisi

on

Precision plots of OPE on OTB-2015

=2.5 10-1 [0.709]

=1.5 10-1 [0.694]

=2.0 10-1 [0.693]

=3.5 10-1 [0.691]

=3.0 10-1 [0.681]

0 10 20 30 40 50

Location error threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Pre

cisi

on

Precision plots of OPE on OTB-2015

=5 10-3 [0.709]

=2.5 10-3 [0.679]

=7.5 10-3 [0.667]

=1 10-2 [0.658]

=1.25 10-2 [0.648]

Fig. 7. Parameter sensitively analysis of η in Eqn. (7) and ω in Eqn. (9)on the OTB-2015 dataset. DPRs are illustrated in brackets.

0 0.2 0.4 0.6 0.8 1

Overlap threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Suc

cess

rat

e

Success plots of OPE on OTB-50SAT

all-123.6fps [0.478]

DSST-28.3fps [0.470]KCF-172.0fps [0.435]GOTURN-165.0fps [0.410]ASLA-8.5fps [0.383]LSK-5.5fps [0.375]SCM-0.5fps [0.372]VTD-5.7fps [0.317]VTS-5.7fps [0.306]Re3-113.7fps [0.262]

0 0.2 0.4 0.6 0.8 1

Overlap threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Suc

cess

rat

e

Success plots of OPE on OTB-2015DSST-28.3fps [0.513]SAT

all-123.6fps [0.512]

KCF-172.0fps [0.475]GOTURN-165.0fps [0.427]SCM-0.5fps [0.417]ASLA-8.5fps [0.409]LSK-5.5fps [0.385]VTD-5.7fps [0.367]VTS-5.7fps [0.362]Re3-113.7fps [0.289]

Fig. 8. OS plots obtained by SATall and 9 competing trackers on theOTB-50 and OTB-2015 datasets. AUCs and speed are illustrated inbrackets.

contain 315 and 365 videos, respectively. For fair comparison,we remove all the videos that also appear in the test data. Weuse the Adam solver [41] for optimization with a learning rateof 2 × 10−4. For each training video, we randomly select abatch size of frames as searching frames, which are used tocrop the searching region patches. The same number of targettemplate patches can be obtained by randomly selecting fromthe rest of frames in the training video. Additionally, in thetraining part, the linear kernel width σ in y and the hyperparameter in Eqn. (3) are respectively set to 4 × 10−1 and10−4. In the tracking part, we set the scale number K, thelearning rate η in Eqn. (7) and the moving average ω in Eqn.(9) to 33, 2.5×10−1, and 5×10−3, respectively. The proposedSAT is implemented using PyTorch [42] on a PC with an Intel6700K 4.0 GHz CPU and a NVIDIA GTX 1080 GPU.

B. Ablation StudiesEvaluation on the variants of SATbasic. We also present

the following variants of SATbasic (i.e., SATscale, SATupdating

and SATall) to analyze the influence of each component inSATbasic, and show the tracking performance obtained by eachvariant of SATbasic on the OTB-2015 dataset. The SATbasic

and its variants are listed as follows:• SATbasic: tracking without using the scale estimation

strategy and the target template updating strategy.• SATscale: tracking by only using the scale estimation

strategy.• SATupdating: tracking by only using the online updating

strategy.• SATall: tracking by using both the scale estimation and

online updating strategies.Table II shows the comparison between SATbasic and its

variants. As can be seen, SATscale achieves 60.8% OSR,outperforming SATbasic with a margin of 2.6%. This shows

Page 7: SAT: Single-shot Adversarial Tracker

IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

TABLE IIICOMPARISONS WITH 5 TOP PERFORMING GENERATIVE MODEL-BASED TRACKERS AND 4 REGRESSION TRACKERS ON THE OTB-50, OTB-2013AND OTB-2015 DATASETS USING THE OPE METRIC. THE LAST ROW SHOWS THE AVERAGE TRACKING SPEED (* INDICATES THE GPU SPEED,

OTHERWISE IT INDICATES THE CPU SPEED) OBTAINED BY EACH TRACKER. THE CNN-BASED TRACKERS ARE REPORTED WITH THE GPU SPEED,WHILE THE OTHERS ARE REPORTED WITH THE CPU SPEED.

Generative model-based trackers Regression trackers SATbasic SATallSCM ASLA VTD VTS LSK GOTURN DSST Re3 KCF

OTB-50 DPR (%) 44.1 43.1 40.8 39.6 38.2 53.7 60.4 38.4 61.1 61.8 64.2OSR (%) 37.0 38.8 33.0 32.0 32.4 42.9 53.8 24.5 45.5 50.2 54.3

OTB-2013 DPR (%) 59.7 53.2 57.6 57.5 50.5 62.0 74.0 46.0 74.1 72.4 75.4OSR (%) 55.7 51.1 49.3 49.6 45.6 53.2 67.0 30.5 62.2 64.2 67.9

OTB-2015 DPR (%) 53.6 51.4 51.1 50.4 49.5 57.2 68.0 40.6 69.2 69.3 70.9OSR (%) 47.8 47.3 40.4 39.6 44.3 48.6 60.1 25.1 54.8 50.0 62.0

Speed (FPS) 0.5 8.5 5.7 5.7 5.5 165.0* 28.3 113.7* 172.0 212.2* 123.6*

TABLE IVTHE DP RATES AT A THRESHOLD OF 20 PIXELS OBTAINED BY THE 9 COMPETING TRACKERS AND THE PROPOSED SATbasic AND SATall ON THEOTB-2015 DATASET OVER DIFFERENT ATTRIBUTES USING THE OPE METRIC. THE FIRST AND SECOND BEST RESULTS ARE SHOWN IN COLOR.

AttributeMethod SCM VTD VTS LSK GOTURN DSST Re3 KCF ASLA SATbasic SATall

Fast Motion 30.5 32.8 32.9 44.7 46.5 55.2 34.7 62.0 24.9 61.9 67.0Motion Blur 26.2 28.6 27.3 40.8 39.9 56.7 35.2 60.0 24.6 63.2 68.7Illumination Variation 51.9 50.4 50.2 46.8 56.7 72.1 36.7 70.8 54.5 63.6 65.3Scale Variation 53.5 52.0 51.1 47.1 58.0 63.8 39.4 63.3 52.4 67.8 68.3Occlusion 51.4 48.4 45.6 47.5 48.6 59.7 30.3 62.2 47.2 63.6 65.7Deformation 47.9 46.1 46.2 43.6 57.5 54.2 34.1 61.7 47.7 67.8 68.0Out-of-plane Rotation 54.8 57.4 55.8 49.6 59.3 64.4 37.8 67.0 52.4 73.2 73.9In-plane Rotation 51.3 56.1 54.2 50.1 55.7 69.1 40.2 69.3 50.6 75.6 77.5Background Clutter 57.5 53.9 52.5 44.9 56.2 70.4 42.7 71.2 52.0 62.1 62.0Low Resolution 78.4 53.7 50.9 50.0 80.1 64.9 60.5 67.1 62.7 80.6 80.4Out of View 38.5 40.3 38.8 40.3 44.2 48.1 31.4 49.8 34.8 57.3 60.4

the effectiveness of the proposed scale estimation strategy.Additionally, when we update the target template online,SATupdating can better handle target appearance variation andachieve better DPR performance than SATbasic. In contrast,when applying both strategies, SATall achieves the best track-ing performance while still maintaining real-time trackingspeed up to 123.6 FPS. In the following experiments, we willselect SATall as the variant of SATbasic for comparison.

Generator architecture exploration. In Fig. 5, the trainingand validation curves obtained by the TPM generator withdifferent feature combinations including generator@last, gen-erator@last two and generator@last three have been plotted.Note that generator@last, generator@last two and genera-tor@last three represent the feature combinations of the thelast layer, last two layers and the last three layers, respectively.As can be seen, combining multi-layer features is helpful toachieve better convergence rate and accuracy, especially forthe generator (i.e., generator@last three) that combines thefeatures extracted from all the last three layers. This is becausethat the features used in the generator@last three providemore rich information, which incorporates both mid-level andhigh-level information, thus leading to more accurate estima-tion. Thus, in this paper, we choose the generator@last three(see Fig. 3 for detail) as our TPM generator.

Non-adversarial training. To assess the influence of theadversarial training on the performance of SATbasic, we trainthe TPM generator only using the MSE loss in Eqn. (1) (calledas SAT-MSE) with the same training data used by SATbasic.We evaluate SATbasic and SAT-MSE on OTB-50 and OTB-2015 to show the gain of the adversarial training. Fig. 6shows the DP rates at the threshold of 20 pixels obtained by

SATbasic and SAT-MSE. As we can see, the DPR obtainedby SATbasic is higher than that obtained by SAT-MSE, whichshows that the adversarial training facilitates SATbasic to learnmore discriminative features for visual tracking.

Parameter sensitivity analysis. We perform parametersensitively analysis to investigate how the two parameters usedin the proposed SAT (i.e., including the learning rate η inEqn. (7) and the moving average ω in Eqn. (9)) contribute tothe final tracking performance. Note that we use SATall asour baseline tracker for evaluation, and when we vary η orω, the other parameters set in part A Section IV are fixed.As can be seen in Fig. 7, setting a befitting value for η andω can effectively facilitate the online adaptation learning oftarget appearance variations, such that the proposed SATall

can achieve better tracking results. Specifically, our SATall

achieves top performing results when η = 2.5 × 10−1 andω = 5 × 10−3. In the following experiments, we fix theseparameter settings for further evaluation.

C. Evaluation on the OTB Datasets

1) Comparison with generative and regressive track-ers. Since the proposed SAT is based on a deep generativemodel and regresses target probability maps to perform visualtracking, we select two groups of trackers to compare withthe proposed SATbasic and its variant SATall. The first groupcontains the top 5 performing generative model-based trackersof the 18 generative trackers in [3], including SCM [24],ASLA [23], VTD [18], VTS [19] and LSK [43]. The codesof these generative model-based trackers are open source,and they can be found in the code library provided by theOTB benchmark [3]. Furthermore, the second group contains 4

Page 8: SAT: Single-shot Adversarial Tracker

IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

20 40 60 80 100 120 140 160 180 200 220Speed (FPS)

65

70

75

80

85

90

DPR

(%)

Speed-accuracy for each tracker on OTB-2013

TADT: 33.7 FPSDaSiamRPN: 115.6 FPSSiamRPN: 160.0 FPSCFNet: 67.0 FPSUDT: 70.0 FPSSiamFC: 86.0 FPSSATall: 123.6 FPSSATbasic: 212.2 FPS

Fig. 9. The speed-accuracy results obtained by the proposed SATall

and SATbasic and the other 6 state-of-the-art real-time trackers on theOTB-2013 dataset. The speed of each tracker is shown in the bracket.

state-of-the-art regression trackers, which are closely related toour tracker and can run at real-time speed (FPS>25), includingtwo CNN-based trackers (i.e., GOTURN [11] and Re3 [12])and two correlation filter-based trackers (i.e., KCF [44], DSST[45]). Note that the source codes of all the above mentionedtrackers are provided in their original papers, and we use theirofficial implementations for fair comparison. Among thesetrackers, the most related trackers to the proposed tracker areRe3 and GOTURN. Although some discriminative trackers(such as MDNet [9] and SANet [10]) can achieve promisingperformance, we do not compare SAT with them due to theirlow speed (around 1 FPS).

Overall performance. Fig. 8 shows the comparison of theproposed SATall with the other 9 competing trackers. Ascan be seen, the proposed SATall achieves the best AUC(47.8%) among all the compared trackers on the OTB-50dataset followed by DSST (47.0%), KCF (43.5%) and TLD(41.3%), which demonstrates the overall effectiveness of theproposed tracker. For the OTB-100 dataset, our SATall obtainsthe comparable AUC (51.2%) to DSST (51.3%), meanwhileachieving the above real-time tracking speed of 123.6 FPS,which is almost 4.4 times faster than DSST. We report the OPEresults obtained by the competing trackers on the OTB datasetin Table III. Compared with the five generative model-basedtrackers in Table III, the proposed SATbasic achieves the bestresults on both the DPR and OSR metrics on the three datasets,its speed is up to 212.2 FPS. More specifically, on the OTB-100 dataset, the DPR obtained by SATbasic is 69.3%, whichoutperforms the second best generative model-based tracker(i.e., SCM) with a large margin of 15.7%. In the meanwhile,SATbasic is about 424.4 times faster than SCM. The GOTURNand Re3 trackers, which are the two CNN-based regressiontrackers, use the similar training data as ours. However, ourSATbasic significantly outperforms the two trackers on boththe OSR and DPR metrics, which is mainly due to the betterfeature representations learned from the proposed method.

Attribute analysis. The 100 video sequences in OTB-2015are annotated with 11 attributes. We evaluate each trackerusing the video sequences with different attributes. Table IVshows the DPRs obtained by the 11 competing trackers onthe video sequences with 11 different attributes using theOPE metric. The proposed SATall achieves the best results onthe videos annotated with 8 out of 11 attributes. Specifically,for the video sequences with the Occlusion and Out of Viewattributes, SATall still achieves the best performance, whichshows its ability for real-time long-term tracking. Additionally,the DPRs obtained by SATall and SATbasic for the videos with

TABLE VTHE EAO, ACCURACY AND FAILURES OBTAINED BY THE PROPOSED

SATall AND 5 STATE-OF-THE-ART TRACKERS ON THE VOT2016DATASET. THE BEST AND THE SECOND BEST RESULTS ARE

HIGHLIGHTED BY THE RED BOLD AND BLUE FONTS, RESPECTIVELY.

SATall Staple MDNet SiamFC HCF KCFEAO 0.27 0.30 0.26 0.24 0.22 0.19

Accuracy 0.55 0.55 0.54 0.53 0.45 0.49Failures 22.64 23.90 21.08 29.80 23.86 37.0

Illumination Variation attribute is lower than that obtained byDSST. The main reason is that DSST uses HOG features,which are more robust to illumination variations.

2) Comparison with state-of-the-art real-time trackers.We evaluate the proposed SATbasic and SATall with thecomparison to 6 state-of-the-art real-time trackers, includingTADT [46], UDT [47], DaSiamRPN [34], SiamRPN [33],SiamFC [30] and CFNet [48]. As shown in Fig. 9, theproposed generative trackers achieve inferior tracking accuracycompared with the other 6 discriminative trackers, the speedof the proposed trackers is very promising. In particular, thespeed of SATbasic is 212.2 FPS, which is about 3.0 and 6.3times faster than UDT and TADT, respectively. The high speedof SATbasic can be attributed to three aspects: (1) The fullyconvolutional network in SATbasic has a small number ofparameters; (2) The TPM generator in SATbasic can generatethe target probability map corresponding to the searchingregion in a single shot, by which the target state can beefficiently estimated; (3) SATbasic can avoid time-consumingonline fine-tuning procedure.

D. Evaluation on the VOT-2016 DatasetIn this subsection, we evaluate the proposed SATall on the

VOT-2016 [37] dataset with the comparison to 5 state-of-the-art trackers including MDNet [9], SiamFC [30], HCFT [49],Staple [50] and KCF [44]. For all the compared trackers, weuse their open source codes provided by the authors and followthe VOT protocol [37] for fair evaluation. The tracking perfor-mance obtained by each tracker is shown in Table V. As canbe seen, the proposed SATall achieves the comparable EAO(0.27) to Staple (0.30), meanwhile significantly outperformingSiamFC (0.24), HCF (0.22) and KCF (0.19) with relative gainsof 12.5%, 22.7% and 42.1%, respectively. In terms of accuracy,our SATall achieves the best result (0.55), which outperformsthe classification-based discriminative tracker (MDNet) anddemonstrates the superiority of the proposed deep generativetracker. Furthermore, the proposed SATall achieves the favor-able failures of 22.64 which is only inferior to MDNet andoutperforms SiamFC and HCF. Overall, the proposed SATall

achieves the favorable performance on VOT-2016.

E. Evaluation on the UAV20L DatasetIn practical video surveillance tasks, there are usually chal-

lenging long video sequences to handle. Thus, developing arobust tracker that can effectively handle with the challenges(e.g., occlusion and out of view) for long video sequences isessential. To evaluate the long-term tracking ability of SATall

, we conduct the following experiment on the UAV20L dataset[51], which contains 20 challenging long video sequences.

Page 9: SAT: Single-shot Adversarial Tracker

IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

0 50 100 150

Speed (FPS)

30

35

40

45

50

55

60

DPR

(%)

Speed-accuracy for each tracker on UAV20LOurs (57.2)MDNet (57.0)HCF (49.0)SRDCF (50.7)Staple (48.5)MUSTer (51.7)MEEM (48.2)KCF (31.1)Re3 (41.7)HDT (45.5)KCFDPT (51.9)

Real-time

Fig. 10. The speed-accuracy results obtained by our proposed SATall

tracker and the other 10 competing trackers on the UAV20L dataset. TheDPR obtained by each tracker is shown in the bracket.

We compare our SATall with 10 state-of-the-art trackers,including MDNet [9], HCF [49], SRDCF [52], Staple [50],Re3 [12], MUSTer [53], MEEM [54], KCF [44], HDT [55]and KCFDPT [56]. Note that we use their codes and defaulthyper-parameters in their original papers for fair comparison.

The speed-accuracy result obtained by each tracker is shownin Fig. 10. As can be seen, the proposed SATall not onlyachieves the beyond real-time tracking speed of 60.3 FPS (thespeed degradation is mainly due to the high resolution of theimages in UAV20L), but also obtains the best DPR (57.2%)on UAV20L, which is followed by MDNet (57.0), KCFDPT(51.9) and MUSTer (51.7). The comparison results experimen-tally demonstrate the superiority of the proposed tracker. Com-pared with the state-of-the-art long-term KCFDPT tracker,SATall significantly outperforms KCFDPT with a large gainof 10.2% in terms of DPR, which shows the better long-termtracking ability of the proposed SATall.

V. CONCLUSION

In this paper, we propose a single-shot adversarial tracker(SAT) for efficient visual object tracking. Different from recentpopular discriminative model-based trackers that formulatevisual tracking as a classification task, the proposed SAT for-mulates visual tracking as a target probability map generationtask and performs per-frame visual tracking efficiently. Morespecifically, we first design a lightweight convolutional neu-ral network-based TPM generator, which incorporates multi-layer features to generate the target probability map. Next,an adversarial learning framework is presented to effectivelytrain the proposed TPM generator. Based on the learnedTPM generator, our SAT is proposed, which can run at theaverage tracking speed of 212 FPS on a single GPU whilestill achieving favorable performance. To further handle morecomplicated cases, we present a variant of SAT by using bothscale estimation and online updating strategies, which achievesbetter tracking accuracy while still maintaining fast trackingspeed (more than 100 FPS). Extensive experiments on sixpopular benchmarks show the great potentials of the proposedSAT in practical industrial applications.

REFERENCES

[1] A. M. Bencloucif, A. Nguyen, C. Sentouh, and J. Popieul, “Cooper-ative trajectory planning for haptic shared control between driver andautomation in highway driving,” IEEE Trans. Ind. Electron., 2019.

[2] W.-H. Huang, D. J. Ballance, and P. J. Gawthrop, “A nonlinear distur-bance observer for robotic manipulators,” IEEE Trans. Ind. Electron.,vol. 47, no. 4, pp. 932–938, 2000.

[3] Y. Wu, J. Lim, and M.-H. Yang, “Object tracking benchmark,” IEEETrans. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1834–1848, 2015.

[4] W. Chung and S. Kim, “Safe navigation of a mobile robot consideringvisibility of environment,” IEEE Trans. Ind. Electron., vol. 56, no. 10,pp. 3941–3950, 2009.

[5] Y. Wu, J. Lim, and M.-H. Yang, “Online object tracking: A benchmark,”in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 2411–2418,2013.

[6] R. Girshick, “Fast r-cnn,” in Proc. IEEE Int. Conf. Comput. Vis., pp.1440–1448, 2015.

[7] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 39, no. 4, pp. 640–651, 2017.

[8] Z. Zheng, L. Zheng, and Y. Yang, “Unlabeled samples generated by ganimprove the person re-identification baseline in vitro,” in Proc. IEEE Int.Conf. Comput. Vis., pp. 3774–3782, 2017.

[9] H. Nam and B. Han, “Learning multi-domain convolutional neuralnetworks for visual tracking,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., pp. 4293–4302, 2016.

[10] H. Fan and H. Ling, “Sanet: Structure-aware network for visual track-ing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshop, pp.2217–2224, 2017.

[11] D. Held, S. Thrun, and S. Savaresei, “Learning to track at 100 fps withdeep regression networks,” in Proc. Eur. Conf. Comput. Vis., pp. 749–765, 2016.

[12] D. Gorden, A. Farhadi, and D. Fox, “Re3: Real-time recurrent regressionnetworks for object tracking,” IEEE Robo. and Auto. Lett., vol. 3, no. 2,pp. 788–795, 2017.

[13] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” inProc. Adv. Neural Inf. Process. Syst., pp. 2672–2680, 2014.

[14] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-imagetranslation using cycle-consistent adversarial networks,” in Proc. IEEEInt. Conf. Comput. Vis., pp. 2242–2251, 2017.

[15] Y. Song, C. Ma, X. Wu, L. Gong, L. Bao, W. Zuo, C. Shen, R. Lau, andM.-H. Yang, “Vital: Visual tracking via adversarial learning,” in Proc.IEEE Conf. Comput. Vis. Pattern Recognit., pp. 8990–8999, 2018.

[16] F. Zhao, J. Wang, Y. Wu, and M. Tang, “Adversarial deep tracking,”IEEE Trans. Circuits Syst. Video Technol., pp. 788–795, 2018.

[17] X. Wang, C. Li, B. Luo, and J. Tang, “Sint++: Robust visual tracking viaadversarial positive instance generation,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit., pp. 4864–4873, 2018.

[18] J. Kwon and K. M. Lee, “Visual tracking decomposition,” in Proc. IEEEConf. Comput. Vis. Pattern Recognit., pp. 1269–1276, 2010.

[19] J. Kwon and K. M. Lee, “Tracking by sampling trackers,” in Proc. IEEEInt. Conf. Comput. Vis., pp. 1195–1202, 2011.

[20] X. Mei and H. Ling, “Robust visual tracking and vehicle classificationvia sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 33, no. 11, pp. 2259–2272, 2011.

[21] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, “Robust visual tracking viamulti-task sparse learning,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., pp. 2042–2049, 2012.

[22] Z. Hong, X. Mei, D. Prokhorov, and D. Tao, “Tracking via robust multi-task multi-view joint sparse representation,” in Proc. IEEE Int. Conf.Comput. Vis., pp. 649–656, 2013.

[23] X. Jia, H. L. Hong, and M.-H. Yang, “Visual tracking via adaptivestructural local sparse appearance model,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit., pp. 1822–1829, 2012.

[24] W. Zhong, H. Lu, and M.-H. Yang, “Robust object tracking via sparsecollaborative appearance model,” IEEE Trans. Image Process., vol. 23,no. 5, pp. 2356–2368, 2014.

[25] H. WANG, D. Suter, and K. Schindler, “Effective appearance model andsimilarity measure for particle filtering and visual tracking,” in Proc. Eur.Conf. Comput. Vis., pp. 606–618, 2006.

[26] B. Han, H. Adam, and J. Sim, “Branchout: Regularization for onlineensemble tracking with cnns,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., pp. 3356–3365, 2017.

[27] S. Pu, Y. Song, C. Ma, H. Zhang, and M.-H. Yang, “Deep attentivetracking via reciprocative learning,” in Proc. Adv. Neural Inf. Process.Syst., pp. 1935–1945, 2018.

[28] S. Yun, J. Choi, and Y. Yoo, “Action-decision networks for visual track-ing with deep reinforcement learning,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit., pp. 2711–2720, 2017.

[29] E. Park and A. C. Berg, “Meta-tracker: Fast and robust online adaptationfor visual object trackers,” in Proc. Eur. Conf. Comput. Vis., pp. 569–585, 2018.

[30] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr.,“Fully-convolutional siamese networks for object tracking,” in Proc. Eur.Conf. Comput. Vis. Workshop, pp. 850–865, 2016.

Page 10: SAT: Single-shot Adversarial Tracker

IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS

[31] Z. Zhu, W. Wu, and Z. W, “End-to-end flow correlation tracking withspatial-temporal attention,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., pp. 548–557, 2018.

[32] Q. Wang, Z. Teng, and J. Xing, “Learning attentions: residual attentionalsiamese network for high performance online visual tracking,” in Proc.IEEE Conf. Comput. Vis. Pattern Recognit., pp. 4854–4863, 2018.

[33] B. Li, W. Wu, Z. Zhu, and J. Yan, “High performance visual trackingwith siamese region proposal network,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit., pp. 8971–8980, 2018.

[34] Z. Zhu, Q. Wang, and B. Li, “Distractor-aware siamese networks forvisual object tracking,” in Proc. Eur. Conf. Comput. Vis., pp. 101–117,2018.

[35] Q. Wu, Z. Chen, and L. Cheng, “Hallucinated adversarial learning forrobust visual tracking,” in arXiv:1906.07008, 2019.

[36] H. Noh, S. Hong, and B. Han, “Learning deconvolution network forsemantic segmentation,” in Proc. IEEE Int. Conf. Comput. Vis., pp.1520–1528, 2015.

[37] M. Kristan, A. Leonardis, and J. Matas, “The visual object trackingvot2016 challenge results,” in Proc. IEEE Int. Conf. Comput. Vis.Workshop, 2016.

[38] O. Russakovsky, J. Deng, and et al, “Imagenet large scale visualrecognition challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252,2015.

[39] A. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan,and M. Shah, “Visual tracking: An experimental survey,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 36, no. 7, pp. 1442–1468, 2014.

[40] A. Li, M. Lin, Y. Wu, M.-H. Yang, and S. Yan, “Nus-pro: A new visualtracking challenge,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38,no. 2, pp. 335–349, 2015.

[41] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimiza-tion,” in Proc. Int. Conf. Learning Representations, 2014.

[42] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation inpytorch,” in Proc. Adv. Neural Inf. Process. Syst. Workshop, 2017.

[43] B. Liu, J. Huang, L. Yang, and C. Kulikowsk, “Robust tracking usinglocal sparse appearance model and k-selection,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., pp. 1313–1320, 2011.

[44] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speedtracking with kernelized correlation filters,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 37, no. 3, pp. 583–596, 2015.

[45] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg, “Discriminativescale space tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39,no. 8, pp. 1561–1575, 2017.

[46] X. Li, C. Ma, and B. Wu, “Target-aware deep tracking,” in Proc. IEEEConf. Comput. Vis. Pattern Recognit., pp. 1369–1378, 2019.

[47] N. Wang, Y. Song, C. Ma, W. Zhou, and W. Liu, “Unsupervised deeptracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1308–1317, 2019.

[48] J. Valmadre, L. Bertinetto, and J. Henriques, “End-to-end representationlearning for correlation filter based tracking,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., pp. 2805–2813, 2017.

[49] C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang, “Hierarchical convolu-tional features for visual tracking,” in Proc. IEEE Int. Conf. Comput.Vis., pp. 3074–3082, 2015.

[50] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. Torr, “Staple:Complementary learners for real-time tracking,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., pp. 1401–1409, 2016.

[51] M. Mueller, N. Smith, and B. Ghanem, “A benchmark and simulator foruav tracking,” in Proc. Eur. Conf. Comput. Vis., pp. 445–461, 2016.

[52] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg, “Learning spatiallyregularized correlation filters for visual tracking,” in Proc. IEEE Int.Conf. Comput. Vis., pp. 4310–4318, 2015.

[53] Z. Hong, Z. Chen, C. Wang, X. Mei, D. Prokhorov, and D. Tao, “Multi-store tracker (muster): A cognitive psychology inspired approach toobject tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,pp. 749–758, 2015.

[54] J. Zhang, S. Ma, and S. Sclaroff, “Meem: Robust tracking via multipleexperts using entropy minimization,” in Proc. Eur. Conf. Comput. Vis.,pp. 188–203, 2014.

[55] Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, J. Lim, and M.-H. Yang,“Hedged deep tracking,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., pp. 4303–4311, 2016.

[56] D. Huang, L. Luo, and Z. Chen, “Applying detection proposals to visualtracking for scale and aspect ratio adaptability,” Int. J. Comput. Vis., vol.122, no. 3, pp. 524–541, 2017.

Qiangqiang Wu received the BS degree fromthe School of Information and Electronic En-gineering, Zhejiang Gongshang University in2016, and the MS degree in Computer Sciencefrom Xiamen University in 2019. He is workingtoward the Ph.D. degree with the Departmentof Computer Science in City University of HongKong, Hong Kong, China. His research interestsinclude computer vision and machine learning.

Hanzi Wang received the Ph.D. (Hons.) de-gree in Computer Vision from Monash Univer-sity. He is currently a Distinguished Professorof ”Minjiang Scholars” in Fujian province anda Founding Director of the Center for PatternAnalysis and Machine Intelligence (CPAMI) atXiamen University in China. His research inter-ests are concentrated on computer vision andpattern recognition including visual tracking, ro-bust statistics, object detection, video segmen-tation, model fitting, optical flow calculation, 3D

structure from motion, image segmentation and related fields.

Yi Liu received the BS degree from the Schoolof Automation, Beijing Institute of Technology in2011, and the MS degree from the School ofAeronautic Science and Engineering, BeihangUniversity in 2015. He is currently pursuing thePh.D. degree with the School of Informatics,Xiamen University, Xiamen, China. His currentresearch interests include pattern recognition,metric learning and visual tracking.

Liming Zhang received the B.S. degree in com-puter software from Nankai University, China,the M.S. degree in signal processing from theNanjing University of Science and Technology,China, and the Ph.D. degree in image process-ing from the University of New England, Aus-tralia. She is currently an Assistant Professorwith the Faculty of Science and Technology, Uni-versity of Macau. Her research interests includesignal processing, image processing, computervision, and IT in education.

Xinbo Gao (M’02-SM’07) received the B.Eng.,M.Sc., and Ph.D. degrees in signal and infor-mation processing from Xidian University, Xi’an,China, in 1994, 1997, and 1999, respectively.He is currently a Cheung Kong Professor withthe Ministry of Education, a Professor of pat-tern recognition and intelligent system, and theDirector of the State Key Laboratory of Inte-grated Services Networks, Xidian University. Hiscurrent research interests include multimediaanalysis, computer vision, pattern recognition,

machine learning, and wireless communications.