10
Siamese Cascaded Region Proposal Networks for Real-Time Visual Tracking Heng Fan Haibin Ling Department of Computer and Information Sciences, Temple University {hengfan,hbling}temple.edu Abstract Region proposal networks (RPN) have been recently combined with the Siamese network for tracking, and shown excellent accuracy with high efficiency. Nevertheless, previ- ously proposed one-stage Siamese-RPN trackers degenerate in presence of similar distractors and large scale variation. Addressing these issues, we propose a multi-stage tracking framework, Siamese Cascaded RPN (C-RPN), which con- sists of a sequence of RPNs cascaded from deep high-level to shallow low-level layers in a Siamese network. Com- pared to previous solutions, C-RPN has several advantages: (1) Each RPN is trained using the outputs of RPN in the previous stage. Such process stimulates hard negative sam- pling, resulting in more balanced training samples. Con- sequently, the RPNs are sequentially more discriminative in distinguishing difficult background (i.e., similar distrac- tors). (2) Multi-level features are fully leveraged through a novel feature transfer block (FTB) for each RPN, further im- proving the discriminability of C-RPN using both high-level semantic and low-level spatial information. (3) With mul- tiple steps of regressions, C-RPN progressively refines the location and shape of the target in each RPN with adjusted anchor boxes in the previous stage, which makes localiza- tion more accurate. C-RPN is trained end-to-end with the multi-task loss function. In inference, C-RPN is deployed as it is, without any temporal adaption, for real-time tracking. In extensive experiments on OTB-2013, OTB-2015, VOT- 2016, VOT-2017, LaSOT and TrackingNet, C-RPN consis- tently achieves state-of-the-art results and runs in real-time. 1. Introduction Visual tracking is one of the most fundamental problems in computer vision, and has a long list of applications such as robotics, human-machine interaction, intelligent vehicle, surveillance and so forth. Despite great advances in recent years, visual tracking remains challenging due to many fac- tor including occlusion, scale variation, etc. Recently, Siamese network has drawn great attention in the tracking community owing to its balanced accuracy and GT C-RPN Siamese-RPN Figure 1. Comparisons between one-stage Siamese-RPN [22] and C-RPN on two challenging sequences: Bolt2 (the top row) with similar distractors and CarScale (the bottom row) with large scale changes. We observe that C-RPN can distinguish the target from distractors, while Siamese-RPN drifts to the background in Bolt2. In addition, compared to using a single regressor in Siamese-RPN, multi-regression in C-RPN can better localize the target in pres- ence of large scale changes in CarScale. Best viewed in color. speed. By formulating object tracking as a matching prob- lem, Siamese trackers [44, 2, 45, 16, 18, 22, 50, 57] aim to learn offline a generic similarity function from a large set of videos. Among these methods, the work of [22] proposes a one-stage Siamese-RPN for tracking by introducing the re- gional proposal network (RPN), originally used for object detection [37, 28], into Siamese network. With the proposal extraction by RPN, this approach simultaneously performs classification and localization from multiple scales, achiev- ing excellent performance. Besides, the use of RPN avoids applying the time-consuming pyramid for target scale esti- mation [2], resulting in a super real-time solution. 1.1. Problem and Motivation Despite having achieved promising result, Siamese-RPN may drift to the background especially in presence of sim- ilar semantic distractors (see Fig. 1). We identify two rea- sons accounting for this. First, the distribution of training samples is imbalanced: (1) positive samples are far less than negative samples, lead- ing to ineffective training of the Siamese network; and (2) most negative samples are easy negatives (non-similar non- arXiv:1812.06148v1 [cs.CV] 14 Dec 2018

Siamese Cascaded Region Proposal Networks for Real-Time ...Siamese Cascaded Region Proposal Networks for Real-Time Visual Tracking Heng Fan Haibin Ling Department of Computer and Information

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Siamese Cascaded Region Proposal Networks for Real-Time ...Siamese Cascaded Region Proposal Networks for Real-Time Visual Tracking Heng Fan Haibin Ling Department of Computer and Information

Siamese Cascaded Region Proposal Networks for Real-Time Visual Tracking

Heng Fan Haibin LingDepartment of Computer and Information Sciences, Temple University

{hengfan,hbling}temple.edu

Abstract

Region proposal networks (RPN) have been recentlycombined with the Siamese network for tracking, and shownexcellent accuracy with high efficiency. Nevertheless, previ-ously proposed one-stage Siamese-RPN trackers degeneratein presence of similar distractors and large scale variation.Addressing these issues, we propose a multi-stage trackingframework, Siamese Cascaded RPN (C-RPN), which con-sists of a sequence of RPNs cascaded from deep high-levelto shallow low-level layers in a Siamese network. Com-pared to previous solutions, C-RPN has several advantages:(1) Each RPN is trained using the outputs of RPN in theprevious stage. Such process stimulates hard negative sam-pling, resulting in more balanced training samples. Con-sequently, the RPNs are sequentially more discriminativein distinguishing difficult background (i.e., similar distrac-tors). (2) Multi-level features are fully leveraged through anovel feature transfer block (FTB) for each RPN, further im-proving the discriminability of C-RPN using both high-levelsemantic and low-level spatial information. (3) With mul-tiple steps of regressions, C-RPN progressively refines thelocation and shape of the target in each RPN with adjustedanchor boxes in the previous stage, which makes localiza-tion more accurate. C-RPN is trained end-to-end with themulti-task loss function. In inference, C-RPN is deployed asit is, without any temporal adaption, for real-time tracking.In extensive experiments on OTB-2013, OTB-2015, VOT-2016, VOT-2017, LaSOT and TrackingNet, C-RPN consis-tently achieves state-of-the-art results and runs in real-time.

1. IntroductionVisual tracking is one of the most fundamental problems

in computer vision, and has a long list of applications suchas robotics, human-machine interaction, intelligent vehicle,surveillance and so forth. Despite great advances in recentyears, visual tracking remains challenging due to many fac-tor including occlusion, scale variation, etc.

Recently, Siamese network has drawn great attention inthe tracking community owing to its balanced accuracy and

GT C-RPN Siamese-RPN

Figure 1. Comparisons between one-stage Siamese-RPN [22] andC-RPN on two challenging sequences: Bolt2 (the top row) withsimilar distractors and CarScale (the bottom row) with large scalechanges. We observe that C-RPN can distinguish the target fromdistractors, while Siamese-RPN drifts to the background in Bolt2.In addition, compared to using a single regressor in Siamese-RPN,multi-regression in C-RPN can better localize the target in pres-ence of large scale changes in CarScale. Best viewed in color.

speed. By formulating object tracking as a matching prob-lem, Siamese trackers [44, 2, 45, 16, 18, 22, 50, 57] aim tolearn offline a generic similarity function from a large set ofvideos. Among these methods, the work of [22] proposes aone-stage Siamese-RPN for tracking by introducing the re-gional proposal network (RPN), originally used for objectdetection [37, 28], into Siamese network. With the proposalextraction by RPN, this approach simultaneously performsclassification and localization from multiple scales, achiev-ing excellent performance. Besides, the use of RPN avoidsapplying the time-consuming pyramid for target scale esti-mation [2], resulting in a super real-time solution.

1.1. Problem and Motivation

Despite having achieved promising result, Siamese-RPNmay drift to the background especially in presence of sim-ilar semantic distractors (see Fig. 1). We identify two rea-sons accounting for this.

First, the distribution of training samples is imbalanced:(1) positive samples are far less than negative samples, lead-ing to ineffective training of the Siamese network; and (2)most negative samples are easy negatives (non-similar non-

arX

iv:1

812.

0614

8v1

[cs

.CV

] 1

4 D

ec 2

018

Page 2: Siamese Cascaded Region Proposal Networks for Real-Time ...Siamese Cascaded Region Proposal Networks for Real-Time Visual Tracking Heng Fan Haibin Ling Department of Computer and Information

semantic background) that contribute little useful informa-tion in learning a discriminative classifier [27]. As a con-sequence, the classifier is dominated by the easily classifiedbackground samples, and degrades when encountering dif-ficult similar semantic distractors.

Second, low-level spatial features are not fully explored.In Siamese-RPN (and other Siamese trackers), only featuresof the last layer, which contain more semantic information,are explored to distinguish target/background. In tracking,nevertheless, background distractors and the target may be-long to the same category, and/or have similar semantic fea-tures [48]. In such case, the high-level semantic features areless discriminative in distinguishing target/background.

In addition to the issues above, the one-stage Siamese-RPN applies a single regressor for target localization us-ing pre-defined anchor boxes. These boxes are expected towork well when having a high overlap with the target. How-ever, for model-free visual tracking, no prior information re-garding the target object is known, and it is hard to estimatehow the scale of target changes. Using pre-defined coarseanchor boxes in a single step regression is insufficient foraccurate localization [14, 3] (see again Fig. 1).

The class imbalance problem is addressed in two-stageobject detector (e.g., Faster R-CNN [37]). The first proposalstage rapidly filters out most background samples, and thenthe second classification stage adopts sampling heuristicssuch as a fixed foreground-to-background ratio to maintaina manageable balance between foreground and background.In addition, two steps of regressions achieve accurate local-ization even for objects with extreme shapes.

Motivated by the two-stage detector, we propose a multi-stage tracking framework by cascading a sequence of RPNsto solve the class imbalance problem, and meanwhile fullyexplore features across layers for robust visual tracking.

1.2. Contribution

As the first contribution, we present a novel multi-stage tracking framework, the Siamese Cascaded RPN (C-RPN), to solve the problem of class imbalance by perform-ing hard negative sampling [47, 39]. C-RPN consists of asequence of RPNs cascaded from the high-level to the low-level layers in the Siamese network. In each stage (level),an RPN performs classification and localization, and out-puts the classification scores and the regression offsets forthe anchor boxes in this stage. The easy negative anchorsare then filtered out, and the rest, treated as hard examples,are utilized as training samples for the RPN of the nextstage. Through such process, C-RPN performs stage bystage hard negative sampling. As a result, the distributionsof training samples are sequentially more balanced, and theclassifiers of RPNs are sequentially more discriminative indistinguishing more difficult distractors (see Fig. 1).

Another benefit of C-RPN is the more accurate targetlocalization compared to the one-stage SiamRPN [22]. In-stead of using the pre-defined coarse anchor boxes in a sin-gle regression step, C-RPN consists of multiple steps of re-gressions due to multiple RPNs. In each stage, the anchorboxes (including locations and sizes) are adjusted by the re-gressor, which provides better initialization for the regressorof next stage. As a consequence, C-RPN progressively re-fines the target bounding box, leading to better localizationas shown in Fig. 1.

Leverage features from different layers in the neural net-works has been proven to be beneficial for improving modeldiscriminability [29, 25, 26]. To fully explore both the high-level semantic and the low-level spatial features for visualtracking, we make the second contribution by designatinga novel feature transfer block (FTB). Instead of separatelyusing features from a single layer in one RPN, FTB enablesus to fuse the high-level features into low-level RPN, whichfurther improves its discriminative power to deal with com-plex background, resulting in better performance of C-RPN.Fig. 2 illustrates the framework of C-RPN.

Last but not least, the third contribution is to implementa tracker based on the proposed C-RPN. In extensive exper-iments on six benchmarks, including OTB-2013 [51], OTB-2015 [52], VOT-2016 [19], VOT-2017 [20], LaSOT [10]and TrackingNet [33], our C-RPN consistently achieves thestate-of-the-art results and runs in real-time.

2. Related Work

Visual tracking has been extensively researched in recentdecades. In the following we discuss the most related work,and refer readers to [40, 53, 23] for recent surveys.Deep tracking. Inspired by the successes in image classi-fication [21, 17], deep convolutional neural network (CNN)has been introduced into visual tracking and demonstratedexcellent performances [49, 48, 34, 7, 12, 31, 8, 42]. Wanget al. [49] propose a stacked denoising autoencoder to learngeneric feature representation for object appearance model-ing in tracking. Wang et al. [48] introduce a fully convolu-tional neural network tracking (FCNT) approach by trans-ferring the pre-trained deep features to improve tracking ac-curacy. Ma et al. [31] replace hand-craft features in correla-tion filter tracking with deep features, achieving remarkablegains. Nam and Han [34] propose a light architecture ofCNNs with online fine-tuning to learn generic feature fortracking target. Fan and Ling [12] extend this approach byintroducing a recurrent neural network (RNN) to captureobject structure. Song et al. [42] apply adversary learning inCNN to learn richer representation for tracking. Danelljanet al. [8] propose continuous convolution filters for correla-tion filter tracking, and later optimize this method in [7].Siamese tracking. Siamese network has attracted increas-

Page 3: Siamese Cascaded Region Proposal Networks for Real-Time ...Siamese Cascaded Region Proposal Networks for Real-Time Visual Tracking Heng Fan Haibin Ling Department of Computer and Information

Sharing Weights

Siamese Network

RPN(stage 1)

Feature Transfer Block

Feature Transfer Block

RPN(stage 3)

Feature Transfer Block

Feature Transfer Block

RPN(stage 2)

A2A1

Cascaded Regional Proposal Networks

A3 A4

𝜑1(𝐳) Φ1(𝐳)𝜑2(𝐳)𝜑3(𝐳)

𝜑3(𝐱)𝜑2(𝐱)𝜑1(𝐱)

Φ1(𝐳)Φ2(𝐳)

Φ2(𝐳)

Φ3(𝐳)

Φ3(𝐳)

Φ1(𝐱)

Φ1(𝐱)

Φ2(𝐱)

Φ2(𝐱)

Φ3(𝐱)

Φ3(𝐱)

Figure 2. Illustration of the architecture of C-RPN, including a Siamese network for feature extraction and cascaded regional proposalnetworks for sequential classifications and regressions. The FTB transfers the high-level semantic features for the low-level RPN, and “A”represents the set of anchor boxes, which are gradually refined stage by stage. Best viewed in color.

ing interest for visual tracking because of its balanced accu-racy and accuracy. Tao et al. [44] utilize Siamese networkto off-line learn a matching function from a large set of se-quences, then use the fixed matching function to search forthe target in a local region. Bertinetto et al. [2] introducea fully convolutional Siamese network (SiamFC) for track-ing by measuring the region-wise feature similarity betweenthe target object and the candidate. Owing to its light struc-ture and without model update, SiamFC runs efficiently at80 fps. Held et al. [18] propose the GOTURN approach bylearning a motion prediction model with the Siamese net-work. Valmadre et al. [45] use a Siamese network to learnthe feature representation for correlation filter tracking. Heet al. [16] introduce a two-fold Siamese network for track-ing. Wang et al. [50] incorporate attention mechanism intoSiamese network to learn a more discriminative metric fortracking. Notably, Li et al. [22] combine Siamese networkwith RPN, and propose a one-stage Siamese-RPN tracker,achieving excellent performance. Zhu et al. [57] introducemore negative samples to train a distractor-aware Siamese-RPN tracker. Despite improvement, this approach requireslarge extra training data from other domains.Multi-level features. The features from different layers inthe neural network contain different information. The high-level feature consists of more abstract semantic cues, whilethe low-level layers contains more detailed spatial informa-tion [29]. It has been proven that tracking can be benefitedusing multi-level features. In [31], Ma et al. separately usefeatures in three different layers for three correlation mod-els, and fuse their outputs for the final tracking result. Wanget al. [48] develop two regression models with features fromtwo layers to distinguish similar semantic distractors.Our approach. In this paper, we focus on solving the prob-lem of class imbalance to improve model discriminability.Our approach is related but different from the Siamese-RPNtracker [22], which applies one-stage RPN for classificationand localization and skips the data imbalance problem. In

contrast, our approach cascades a sequence of RPNs to ad-dress the data imbalance by performing hard negative sam-pling, and progressively refines anchor boxes for better tar-get localization using multi-regression. Our method is alsorelated to [31, 48] using multi-level features for tracking.However, unlike [31, 48] in which multi-level features areseparately used for independent models, we propose a fea-ture transfer block to fuse features across layer for eachRPN, improving its discriminative power in distinguishingthe target object from complex background.

3. Siamese Cascaded RPN (C-RPN)

In this section, we detail the Siamese Cascaded RPN (re-ferred to as C-RPN) as shown in Fig. 2.

C-RPN contains two subnetworks: the Siamese networkand the cascaded RPN. The Siamese network is utilized toextract the features of the target template x and the searchregion z. Afterwards, C-RPN receives the features of x andz for each RPN. Instead of only using the features from onelayer, we apply feature transfer block (FTB) to fuse the fea-tures from high-level layers for RPN. An RPN simultane-ously performs classification and localization on the featuremaps of z. According to the classification scores and regres-sion offsets, we filter out the easy negative anchors (e.g.,an anchor whose negative confidence is larger than a presetthreshold θ), and refine the locations and sizes of the restanchors, which are used for training RPN in the next stage.

3.1. Siamese Network

As in [2], we adopt the modified AlexNet [21] to developour Siamese network. The Siamese network comprises twoidentical branches, the z-branch and the x-branch, whichare employed to extract features from the target templatez and the search region x, respectively (see Fig. 2). Thetwo branches are designed to share parameters to ensure thesame transformation applied to both z and x, which is cru-

Page 4: Siamese Cascaded Region Proposal Networks for Real-Time ...Siamese Cascaded Region Proposal Networks for Real-Time Visual Tracking Heng Fan Haibin Ling Department of Computer and Information

Sharing Weights

Siamese Network

Conv.

Conv.

Conv.

Conv.

Corr. cls. scores

Corr. reg. scores

classification

regression

Regional Proposal Networks

Feature Transfer Block

Feature Transfer Block

Regional Proposal Networks

Feature Transfer Block

Feature Transfer Block

Regional Proposal Networks

A1A0

Cascaded Regional Proposal Networks (C-RPN)

A2 A3

loss

𝜑(𝐱)

𝜑(𝐳)

[𝜑 𝐳 ] 𝑐𝑙𝑠

[𝜑 𝐳 ] 𝑟𝑒𝑔

[𝜑 𝐱 ] 𝑟𝑒𝑔

[𝜑 𝐱 ] 𝑐𝑙𝑠

𝐻′ ×𝑊′ × 𝐶′

Conv.

Conv.

Conv.

Conv.

Corr. cls. scores

Corr. reg. scores

classification

regression

loss

loss

Φ(𝐱)

Φ(𝐳)

[Φ 𝐳 ] 𝑐𝑙𝑠

[Φ 𝐳 ] 𝑟𝑒𝑔

[Φ 𝐱 ] 𝑟𝑒𝑔

[Φ 𝐱 ] 𝑐𝑙𝑠

ℎ × 𝑤 × 2𝑘

ℎ × 𝑤 × 4𝑘

Conv.

Conv.

Conv.

Conv.

Corr. cls. scores

Corr. reg. scores

classification

regression

loss

loss

𝜑(𝐱)

𝜑(𝐳)

[𝜑 𝐳 ] 𝑐𝑙𝑠

[𝜑 𝐳 ] 𝑟𝑒𝑔

[𝜑 𝐱 ] 𝑟𝑒𝑔

[𝜑 𝐱 ] 𝑐𝑙𝑠

ℎ × 𝑤 × 2𝑘

ℎ × 𝑤 × 4𝑘

loss

Figure 3. Architecture of RPN. Best viewed in color.

cial for the similarity metric learning. More details aboutthe Siamese network can be referred to [2].

Different from [22] that only uses the features from thelast layer of the Siamese network for tracking, we leveragethe features from multiple levels to improve model robust-ness. For convenience in next, we denote ϕi(z) and ϕi(x)as the feature transformations of z and x from the conv-ilayer in the Siamese network with N layers1.

3.2. One-Stage RPN in Siamese Network

Before describing C-RPN, we first review the one-stageSiamese RPN tracker [22], which consists of two branchesof classification and regression for anchors, as depicted inFig. 3. It takes as inputs the feature transformations ϕ1(z)and ϕ1(x) of z and x from the last layer of the Siamese net-work, and outputs classification scores and regression off-sets for anchors. For simplicity, we remove the subscriptsin feature transformations in next.

To ensure classification and regression for each anchor,two convolution layers are utilized to adjust the channels ofϕ(z) into suitable forms, denoted as [ϕ(z)]cls and [ϕ(z)]reg,for classification and regression, respectively. Likewise, weapply two convolution layers for ϕ(x) but keep the channelsunchanged, and obtain [ϕ(x)]cls and [ϕ(x)]reg . Therefore,the classification scores {ci} and the regression offsets {ri}for each anchor can be computed as

{ci} = corr([ϕ(z)]cls, [ϕ(x)]cls)

{ri} = corr([ϕ(z)]reg, [ϕ(x)]reg)(1)

where i is the anchor index, and corr(a,b) denotes correla-tion between a and b where a is served as the kernel. Eachci is a 2d vector, representing for negative and positive con-fidences of the ith anchor. Similarly, each ri is a 4d vectorwhich represents the offsets of center point location and sizeof the anchor to groundtruth. Siamese RPN is trained with amulti-task loss consisting of two parts, i.e., the classification

1For notation simplicity, we name each layer in the Siamese network inan inverse order, i.e., conv-N , conv-(N − 1), · · · , conv-2, conv-1 for thelow-level to the high-level layers.

GT C-RPN Siamese-RPN

Figure 4. Localization using a single regressor and multiple re-gressors.The multiple regressors in C-RPN can better handle largescale changes for more accurate localization. Best viewed in color.

loss (i.e., softmax loss) and the regression loss (i.e., smoothL1 loss). We refer readers to [22, 37] for further details.

3.3. Cascaded RPN

As mentioned earlier, previous Siamese trackers mostlyignore the problem of class imbalance, resulting in degen-erated performance in presence of similar semantic distrac-tors. Besides, they only use the high-level semantic featuresfrom the last layer, which does not fully explore multi-levelfeatures. To address these issues, we propose a multi-stagetracking framework by cascading a set of L (L ≤ N ) RPNs.

For RPNl in the lth (1 < l ≤ L) stage, it receives fusedfeatures Φl(z) and Φl(x) of the conv-l layer and the high-level layers from FTB, instead of features ϕl(z) and ϕl(x)from a single separate layer [22, 2]. The Φl(z) and Φl(x)are obtained as follows,

Φl(z) = FTB(Φl−1(z), ϕl(z)

)Φl(x) = FTB

(Φl−1(x), ϕl(x)

) (2)

where FTB(·, ·) denotes the FTB as described in Sec-tion 3.4. For RPN1, Φ1(z) = ϕ1(z) and Φ1(x) = ϕ1(x).Therefore, the classification scores {cli} and the regressionoffsets {rli} for anchors in stage l are calculated as

{cli} = corr([Φl(z)]cls, [Φl(x)]cls)

{rli} = corr([Φl(z)]reg, [Φl(x)]reg)(3)

where [Φl(z)]cls, [Φl(x)]cls, [Φl(z)]reg and [Φl(x)]reg arederived by performing convolutions on Φl(z) and Φl(x).

Let Al denote the anchor set in stage l. With classifica-tion scores {cli}, we can filter out anchors inAl whose nega-tive confidences are larger than a preset threshold θ, and therest are formed into a new set of anchor Al+1, which is em-ployed for training RPNl+1. For RPN1, A1 is pre-defined.Besides, in order to provide a better initialization for regres-sor of RPNl+1, we refine the center locations and sizes ofanchors in Al+1 using the regression results {rli} in RPNl,thus generate more accurate localization compared to a sin-gle step regression in Siamese RPN [22], as illustrated inFig. 4. Fig. 2 shows the cascade architecture of C-RPN.

The loss function `RPNlfor RPNl is composed of classi-

fication loss function Lcls (softmax loss) and regression loss

Page 5: Siamese Cascaded Region Proposal Networks for Real-Time ...Siamese Cascaded Region Proposal Networks for Real-Time Visual Tracking Heng Fan Haibin Ling Department of Computer and Information

(a) Region of interest (b) From left to right: response maps of stage 1, stage 2 and stage 3

(a) Region of interest (b) Response map in stage 1 (c) Response map in stage 2 (d) Response map in stage 3

Figure 5. Response maps in different stages. Image (a) is the re-gion of interest, and (b) shows the response maps obtained by RPNin three stages. We can see that RPN is sequentially more discrim-inative in distinguishing distractors. Best viewed in color.

function Lloc (smooth L1 loss) as follows,

`RPNl({cli}, {rli}) =

∑i

Lcls(cli, c

l∗i )+λ

∑i

cl∗i Lloc(rli, r

l∗i )

(4)where i is the anchor index in Al of stage l, λ a weight tobalance losses, cl∗i the label of anchor i, and rl∗i the truedistance between anchor i and groundtruth. Following [37],rl∗i = (rl∗i(x), r

l∗i(y), r

l∗i(w), r

l∗i(h)) is a 4d vector, such that

rl∗i(x) = (x∗ − xla)/wla rl∗i(y) = (y∗ − yla)/hla

rl∗i(w) = log(w∗/wla) rl∗i(h) = log(y∗/hla)

(5)

where x, y, w and h are center coordinates of a box and itswidth and height. Variables x∗ and xla are for groundtruthand anchor of stage l (likewise for y, w and h). It is worthnoting that, different from [22] using fixed anchors, the an-chors in C-RPN are progressively adjusted by the regressorin the previous stage, and computed as

xla = xla + wl−1a rl−1

i(x) yla = yla + hl−1a rl−1

i(y)

wla = wl−1

a exp(rl−1i(w)) hla = hl−1

a exp(rl−1i(h))

(6)

For the anchor in the first stage, x1a, y1a, w1a and h1a are pre-

defined.The above procedure forms the proposed cascaded RPN.

Due to the rejection of easy negative anchors, the distribu-tion of training samples for each RPN is gradually more bal-anced. As a result, the classifier of each RPN is sequentiallymore discriminative in distinguishing difficult distractors.Besides, multi-level feature fusion further improves the dis-criminability in handing complex background. Fig. 5 showsthe discriminative powers of different RPNs by demonstrat-ing detection response map in each stage.

The loss function `CRPN of C-RPN consists of the lossfunctions of all RPNl. For each RPN, loss function is com-puted using Eq. (4), and `CRPN is expresses as

`CRPN =

L∑l=1

`RPNl(7)

3.4. Feature Transfer Block

To effectively leverage multi-level features, we introduceFTB to fuse features across layers so that each RPN is able

Features from low-level layers

Features from high-level layers

𝐻 ×𝑊 × 𝐶

𝐻′ ×𝑊′ × 𝐶′

Deco

nv 𝑘

×𝑘×𝐶′

Co

nv layer 3

×3×𝐶′

ReLU

Co

nv layer 3

×3×𝐶′

Co

nv layer 3

×3×

𝐶′

Elemen

t-wise su

m

ReLU

𝐻′ ×𝑊′ × 𝐶′

𝐻 ×𝑊 × 𝐶′

Interp

olatio

n

Deco

nv

𝑘×𝑘×𝐶′

Co

nv

3×3×𝐶′

Co

nv

3×3×𝐶′

Co

nv

3×3×𝐶′

𝐻 ×𝑊 × 𝐶

𝐻′ ×𝑊 ′ × 𝐶′

𝜑𝑙

Φ𝑙−1

Eltw su

m

ReLU

Interp

olatio

n

𝐻′ ×𝑊 ′ × 𝐶′𝐻 ×𝑊 × 𝐶′

Figure 6. Overview of feature transfer block. Best viewed in color.

to share high-level semantic feature to improve the discrim-inability. In detail, a deconvolution layer is used to matchthe feature dimensions of different sources. Then, differentfeatures are fused using element-wise summation, followeda ReLU layer. In order to ensure the same groundtruth foranchors in each RPN, we apply the interpolation to rescalethe fused features such that the output classification mapsand regression maps have the same resolution for all RPN.Fig. 6 shows the feature transferring for RPNl (l > 1).

3.5. Training and Tracking

Training. The training of C-RPN is performed on the im-age pairs that are sampled within a random interval fromthe same sequence as in [22]. The multi-task loss functionin Eq. (7) enables us to train C-RPN in an end-to-end man-ner. Considering that the scale of target changes smoothlyin two consecutive frames, we employ one scale with dif-ferent ratios for each anchor. The ratios of anchors are setto [0.33, 0.5, 1, 2, 3] as in [22].

For each RPN, we adopt the strategy as in object detec-tion [37] to determine positive and negative training sam-ples. We define the positive samples as anchors whose In-tersection over union (IOU) with groundtruth is larger thana threshold τpos, and negative samples as anchors whoseIoU with groundtruth bounding box is less than a thresholdτneg. We generate at most 64 samples from one image pair.

Tracking. We formulate tracking as multi-stage detection.For each video, we pre-compute feature embeddings for thetarget template in the first frame. In a new frame, we ex-tract a region of interest according to the result in last frame,and then perform detection using C-RPN on this region. Ineach stage, an RPN outputs the classification scores andregression offsets for anchors. The anchors with negativescores lager then θ are discarded, and the rest are refinedand taken over by RPN in next stage. After the last stage L,the remained anchors are regarded as target proposals, fromwhich we determine the best one as the final tracking re-sult using strategies in [22]. Alg. 1 summarizes the trackingprocess by C-RPN.

Page 6: Siamese Cascaded Region Proposal Networks for Real-Time ...Siamese Cascaded Region Proposal Networks for Real-Time Visual Tracking Heng Fan Haibin Ling Department of Computer and Information

Algorithm 1: Tracking with C-RPN1 Input: frame sequences {Xt}Tt=1 and groundtruth bounding

box b1 of X1, trained model C-RPN;2 Output: Tracking results {bt}Tt=2;3 Extract target template z in X1 using b1 ;4 Extract features {ϕl(z)}Ll=1 for z from C-RPN;5 Initialize anchors A1;6 for t = 2 to T do7 Extract the search region x in Xt using bt−1 ;8 Extract features {ϕl(x)}Ll=1 for x from C-RPN;9 for l = 1 to L do

10 if l equals to 1 then11 Φl(z) = ϕl(z), Φl(x) = ϕl(x);12 else13 Φl(z), Φl(x)← Eq. (2) ;14 end15 {cli}, {rli} ← Eq. (3);16 Remove any anchor i from Al whose negative

confidence cli(neg) > θ ;17 Al+1← Refine the rest anchors in Al with {rli}

using Eq. (6);18 end19 Target proposals← AL+1 ;20 Selet the best proposal as tracking result bk using

strategies in [22];21 end

4. ExperimentsImplementation detail. C-RPN is implemented in Mat-lab using MatConvNet [46] on a single Nvidia GTX 1080with 8GB memory. The backbone Siamese network adoptsthe modified AlexNet [21] by removing group convolutions.Instead of training from scratch, we borrow the parametersfrom the pretrained model on ImageNet [9]. During train-ing, the parameters of first two layers are frozen. The num-ber L of stages is set to 3. The thresholds θ, τpos and τnegare empirically set to 0.95, 0.6 and 0.3. C-RPN is trainedend-to-end over 50 epochs using SGD, and the learning rateis annealed geometrically at each epoch from 10−2 to 10−6.We train C-RPN using the training data from [10] for exper-iment under Protocol II on LaSOT [10], and using VID [38]and YT-BB [36] for other experiments.

Note that the comparison with Siamese-RPN [22] is fairsince the same training data is used for training.

4.1. Experiments on OTB-2013 and OTB-2015

We conduct experiments on the popular OTB-2013 [51]and OTB-2015 [52] which consist of 51 and 100 fully an-notated videos, respectively. C-RPN runs at around 36 fps.

Following [51], we adopt the precision plot in one-passevaluation (OPE) to assess different trackers. The compari-son with 14 state-of-the-art trackers (SiamRPN [22], DaSi-

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Overlap threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Suc

cess

rat

e

Success plots of OPE - OTB-2013

C-RPN [0.675]CREST [0.673]PTAV [0.663]SiamRPN [0.658]ACT [0.657]DaSiamRPN [0.655]ECO-HC [0.652]TRACA [0.652]BACF [0.648]SINT [0.635]SiamFC [0.607]HCFT [0.605]HDT [0.603]CFNet [0.603]Staple [0.600]

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Overlap threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Suc

cess

rat

e

Success plots of OPE - OTB-2015

C-RPN [0.663]DaSiamRPN [0.658]ECO-HC [0.643]SiamRPN [0.637]PTAV [0.635]ACT [0.625]CREST [0.623]BACF [0.617]TRACA [0.603]SiamFC [0.582]Staple [0.581]SINT [0.580]CFNet [0.566]HDT [0.564]HCFT [0.562]

Figure 7. Comparisons with stage-of-the-art tracking approacheson OTB-2013 [51] and OTB-2015 [52]. C-RPN achieves the bestresults on both benchmarks. Best viewed in color.

amRPN [57], TRACA [6], ACT [4], BACF [13], ECO-HC [7], CREST [41], SiamFC [2], Staple [1], PTAV [11],SINT [44], CFNet [45], HDT [35] and HCFT [31]) is shownin Fig. 7. C-RPN achieves the best performance on bothtwo benchmarks. In specific, we obtain the 0.675 and 0.663precision scores on OTB-2013 and OTB-2015, respectively.In comparison with the baseline one-stage SiamRPN with0.658 and 0.637 precision scores, we obtain improvementsby 1.9% and 2.6%, showing the advantages of multi-stageRPN in accurate localization. DaSiamRPN uses extra nega-tive training data from other domains to improve the abilityto handle similar distractors, and obtains 0.655 and 0.658precision scores. Without using extra training data, C-RPNoutperforms DaSiamRPN by 2.0% and 0.5%. More resultsand comparisons on OTB-2013 [51] and OTB-2015 [52] areshown in the supplementary material.

4.2. Experiments on VOT-2016 and VOT-2017

VOT-2016 [19] consists of 60 sequences, aiming at as-sessing the short-term performance of trackers. The overallperformance of a tracking algorithm is evaluated using Ex-pected Average Overlap (EAO) which takes both accuracyand robustness into account. The speed of a tracker is rep-resented with a normalized speed (EFO).

We evaluate C-RPN on VOT-2016, and compare it with11 trackers including the baseline SiamRPN [22] and othertop ten approaches in VOT-2016. Fig. 8 shows the EAO ofdifferent trackers. C-RPN achieves the best results, signif-icantly outperforming the baseline SiamRPN and other ap-proaches. Tab. 1 lists the detailed comparisons of differenttrackers on VOT-2016. From Tab. 1, we can see that C-RPNoutperforms other trackers in both accuracy and robustness,and runs efficiently.

VOT-2017 [20] contains 60 sequences, which are devel-oped by replacing the least 10 challenging videos in VOT-2016 [19] with 10 difficult sequences. Different from VOT-2016 [19], VOT-2017 [20] introduces a new real-time ex-periment by taking into both tracking performance and effi-ciency. We compare C-RPN with SiamRPN [22] and other

Page 7: Siamese Cascaded Region Proposal Networks for Real-Time ...Siamese Cascaded Region Proposal Networks for Real-Time Visual Tracking Heng Fan Haibin Ling Department of Computer and Information

123456789101112

Order

0.28

0.3

0.32

0.34

0.36

0.38

Ave

rage

exp

ecte

d ov

erla

p

Expected overlap scores for baselineC-RPNSiamRPNCCOTTCNNSSATMLDFStapleDDCEBTSRBTSTAPLEpDNT

Figure 8. Comparisons on VOT-2016 [19]. Larger (right) value in-dicates better performance. Our C-RPN significantly outperformsthe baseline and other approaches. Best viewed in color.

Table 1. Detailed comparisons on VOT-2016 [19]. The best tworesults are highlighted in red and blue fonts, respectively.

Tracker EAO Accuracy Failure EFO

C-RPN 0.363 0.594 0.95 9.3SiamRPN [22] 0.344 0.560 1.12 23.0

C-COT [8] 0.331 0.539 0.85 0.5TCNN [19] 0.325 0.554 0.96 1.1SSAT [19] 0.321 0.577 1.04 0.5

MLDF [19] 0.311 0.490 0.83 1.2Staple [1] 0.295 0.544 1.35 11.1DDC [19] 0.293 0.541 1.23 0.2EBT [56] 0.291 0.465 0.90 3.0

SRBT [19] 0.290 0.496 1.25 3.7STAPLEp [19] 0.286 0.557 1.32 44.8

DNT [5] 0.278 0.515 1.18 1.1

Table 2. Comparisons on VOT-2017 [20]. The best two results arehighlighted in red and blue fonts, respectively.

Tracker Baseline EAO Real-time EAO

C-RPN 0.289 0.273SiamRPN [22] 0.243 0.244

LSART [43] 0.323 0.055CFWCR [20] 0.303 0.062

CFCF [15] 0.286 0.059ECO [7] 0.280 0.078

Gnet [20] 0.274 0.060MCCT [20] 0.270 0.061C-COT [8] 0.267 0.058

CSRDCF [30] 0.256 0.100SiamDCF [20] 0.249 0.135

MCPF [54] 0.248 0.060

top ten approaches in VOT-2017 using the EAO of base-line and real-time experiments, as shown in Tab. 2. FromTab. 2, C-RPN achieves a EAO score of 0.289, which signif-icantly outperforms the one-stage SiamRPN [22] with EAOscore of 0.243. In addition, compared with LSART [43]and CFWCR [20], C-RPN shows competitive performance.In real-time experiment, C-RPN obtains the best result withEAO score of 0.273, outperforming all other trackers.

4.3. Experiment on LaSOT

LaSOT [10] is a recent large-scale dataset aiming at bothtraining and evaluating trackers. We compare C-RPN to 35approaches, including ECO [7], MDNet [34], SiamFC [2],VITAL [42], StructSiam [55], TRACA [6], BACF [13] andso forth. We refer readers to [10] for more details about thecompared trackers. We do not compare C-RPN to Siamese-RPN [22] because neither its implementation nor results onLaSOT are available.

Following [10], we report the results of success (SUC)for different trackers as shown in Fig. 9. It shows that ourC-RPN outperforms all other state-of-the-art trackers un-der two protocols. We achieve SUC scores of 0.459 and0.455 under protocol I and II, outperforming the second besttracker MDNet with SUC scores 0.413 and 0.397 by 4.6%and 5.8, respectively. In addition, C-RPN runs at around23 fps on LaSOT, which is more efficient than MDNet witharound 1 fps. Compared with the Siamese network-basedtracker SiamFC with 0.358 and 0.336 SUC scores, C-RPNgains the improvements by 11.1% and 11.9%. Due to lim-ited space, we refer readers to supplementary material formore details about results and comparisons on LaSOT.

4.4. Experiment on TrackingNet

TrackingNet [33] is proposed to assess the performanceof a tracker in the wild. We evaluate C-RPN on its testingset with 511 videos. Following [33], we use three metricsprecision (PRE), normalized precision (NPRE) and success(SUC) for evaluation. Tab. 3 demonstrates the comparisonresults to trackers with top PRE scores2, showing that C-RPN achieves the best results on all three metrics. In spe-cific, C-RPN obtains the PRE score of 0.619, NPRE scoreof 0.746 and SUC score of 0.669, outperforming the secondbest tracker MDNet with PRE score of 0.565, NPRE scoreof 0.705 and SUC score of 0.606 by 5.4%, 4.1% and 6.3%,respectively. Besides, C-RPN runs efficiently at a speed ofaround 32 fps.

4.5. Ablation Experiment

To validate the impact of different components, we con-duct ablation experiments on LaSOT (Protocol II) [10] andVOT-2017 [20].Number of stages? As shown in Tab. 4, adding the secondstage significantly improves one-stage baseline. The SUCon LaSOT is improved by 2.9% from 0.417 to 0.446, andthe EAO on VOT-2017 is increased by 3.5% from 0.248 to0.283. The third stage produces 0.9% and 0.6% improve-ments on LaSOT and VOT-2017, respectively. We observe

2The result of C-RPN on TrackingNet [33] is evaluated by the serverprovided by the organizer at http://eval.tracking-net.org/web/challenges/challenge-page/39/leaderboard/42.The results of compared trackers are reported from [33]. Full comparisonis shown in the supplementary material.

Page 8: Siamese Cascaded Region Proposal Networks for Real-Time ...Siamese Cascaded Region Proposal Networks for Real-Time Visual Tracking Heng Fan Haibin Ling Department of Computer and Information

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Overlap threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Suc

cess

rat

e

Success plots of OPE on LaSOT

[0.459] C-RPN[0.413] MDNet[0.412] VITAL[0.358] SiamFC[0.356] StructSiam[0.353] DSiam[0.340] ECO[0.339] SINT[0.315] STRCF[0.311] ECO_HC[0.296] CFNet[0.285] TRACA[0.280] MEEM[0.277] BACF[0.272] HCFT[0.271] SRDCF[0.269] PTAV[0.266] Staple[0.263] CSRDCF[0.262] Staple_CA[0.258] SAMF[0.246] LCT[0.234] Struck[0.233] DSST[0.232] fDSST[0.228] TLD[0.214] SCT4[0.211] ASLA[0.211] KCF[0.186] CN[0.178] CT[0.172] CSK[0.168] L1APG[0.163] MIL[0.151] STC[0.136] IVT

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Overlap threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Suc

cess

rat

e

Success plots of OPE on LaSOT Testing Set

[0.455] C-RPN[0.397] MDNet[0.390] VITAL[0.336] SiamFC[0.335] StructSiam[0.333] DSiam[0.324] ECO[0.314] SINT[0.308] STRCF[0.304] ECO_HC[0.275] CFNet[0.259] BACF[0.257] TRACA[0.257] MEEM[0.250] HCFT[0.250] PTAV[0.245] SRDCF[0.244] CSRDCF[0.243] Staple[0.238] Staple_CA[0.233] SAMF[0.221] LCT[0.212] Struck[0.210] TLD[0.207] DSST[0.203] fDSST[0.194] ASLA[0.191] SCT4[0.178] KCF[0.170] CN[0.158] CT[0.155] L1APG[0.149] CSK[0.139] MIL[0.138] STC[0.118] IVT

Figure 9. Comparisons with state-of-the-art tracking methods on LaSOT [10]. C-RPN outperforms existing approaches on success by largemargins under all two protocols. Best viewed in color.

Table 3. Comparisons on TrackingNet [33] with the best two re-sults highlighted in red and blue fonts, respectively.

PRE NPRE SUC

C-RPN 0.619 0.746 0.669MDNet [34] 0.565 0.705 0.606CFNet [45] 0.533 0.654 0.578SiamFC [2] 0.533 0.663 0.571

ECO [7] 0.492 0.618 0.554CSRDCF [30] 0.48 0.622 0.534

SAMF [24] 0.477 0.598 0.504ECO-HC [7] 0.476 0.608 0.541

Staple [1] 0.470 0.603 0.528Staple CA [32] 0.468 0.605 0.529

BACF [13] 0.461 0.580 0.523

Table 4. Effect on the number of stages in C-RPN.# Stages One stage Two stages Three stages

SUC on LaSOT 0.417 0.446 0.455Speed on LaSOT 48 fps 37 fps 23 fps

EAO on VOT-2017 0.248 0.278 0.289

Table 5. Effect on negative anchor filtering (NAF) in C-RPN.C-RPN w/o NAF C-RPN w/ NAF

SUC on LaSOT 0.439 0.455EAV on VOT-2017 0.282 0.289

Table 6. Effect on feature transfer block in C-RPN.C-RPN w/o FTB C-RPN w/ FTB

SUC on LaSOT 0.442 0.455EAV on VOT-2017 0.278 0.289

that the improvement by the second stage is higher than thatby the third stage. This suggests that most difficult back-ground is handled in the second stage. Adding more stages

may lead to further improvements, but also the computation(speed from 48 to 23 fps).Negative anchor filtering? Filtering out the easy negativesaims to provide more balanced training samples for RPN innext stage. To show its effectiveness, we set threshold θ to1 such that all refined anchors will be send to the next stage.Tab. 5 shows that removing negative anchors in C-RPN canimprove the SUC on LaSOT by 1.6% from 0.439 to 0.455,and the EAO on VOT-2017 by 0.7% from 0.282 to 0.289,respectively, which evidences balanced training samples arecrucial for training more discriminative RPN.Feature transfer block? As demonstrated in Tab. 6, FTBimproves the SUC on LaSOT by 1.3% from 0.442 to 0.455without losing much efficiency, and the EAO on VOT-2017by 1.1% from 0.278 to 0.289, validating the effectiveness ofmulti-level feature fusion in improving performance.

These studies show that each ingredient brings individualimprovement, and all of them work together to produce theexcellent tracking performance.

5. ConclusionIn this paper, we propose a novel multi-stage framework

C-RPN for tracking. Compared with previous arts, C-RPNdemonstrates more robust performance in handling complexbackground such as similar distractors by performing hardnegative sampling within a cascade architecture. In addi-tion, the proposed FTB enables effective feature leverageacross layers for more discriminative representation. More-over, C-RPN progressively refines the target bounding boxusing multiple steps of regressions, leading to more accu-rate localization. In extensive experiments on six popularbenchmarks, C-RPN consistently achieves the state-of-the-art results and runs in real-time.

Page 9: Siamese Cascaded Region Proposal Networks for Real-Time ...Siamese Cascaded Region Proposal Networks for Real-Time Visual Tracking Heng Fan Haibin Ling Department of Computer and Information

References[1] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. H.

Torr. Staple: Complementary learners for real-time tracking.In CVPR, 2016. 6, 7, 8

[2] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, andP. H. Torr. Fully-convolutional siamese networks for objecttracking. In ECCVW, 2016. 1, 3, 4, 6, 7, 8

[3] Z. Cai and N. Vasconcelos. Cascade r-cnn: Delving into highquality object detection. In CVPR, 2018. 2

[4] B. Chen, D. Wang, P. Li, S. Wang, and H. Lu. Real-timeactor-critictracking. In ECCV, 2018. 6

[5] Z. Chi, H. Li, H. Lu, and M.-H. Yang. Dual deep networkfor visual tracking. TIP, 26(4):2005–2015, 2017. 7

[6] J. Choi, H. J. Chang, T. Fischer, S. Yun, K. Lee, J. Jeong,Y. Demiris, and J. Y. Choi. Context-aware deep feature com-pression for high-speed visual tracking. In CVPR, 2018. 6,7

[7] M. Danelljan, G. Bhat, F. S. Khan, M. Felsberg, et al. Eco:Efficient convolution operators for tracking. In CVPR, 2017.2, 6, 7, 8

[8] M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg.Beyond correlation filters: Learning continuous convolutionoperators for visual tracking. In ECCV, 2016. 2, 7

[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009. 6

[10] H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai,Y. Xu, C. Liao, and H. Ling. Lasot: A high-quality bench-mark for large-scale single object tracking. arXiv, 2018. 2,6, 7, 8

[11] H. Fan and H. Ling. Parallel tracking and verifying: Aframework for real-time and high accuracy visual tracking.In ICCV, 2017. 6

[12] H. Fan and H. Ling. Sanet: Structure-aware network forvisual tracking. In CVPRW, 2017. 2

[13] H. K. Galoogahi, A. Fagg, and S. Lucey. Learningbackground-aware correlation filters for visual tracking. InICCV, 2017. 6, 7, 8

[14] S. Gidaris and N. Komodakis. Object detection via a multi-region and semantic segmentation-aware cnn model. InICCV, 2015. 2

[15] E. Gundogdu and A. A. Alatan. Good features to correlatefor visual tracking. TIP, 27(5):2526–2540, 2018. 7

[16] A. He, C. Luo, X. Tian, and W. Zeng. A twofold siamesenetwork for real-time object tracking. In CVPR, 2018. 1, 3

[17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In CVPR, 2016. 2

[18] D. Held, S. Thrun, and S. Savarese. Learning to track at 100fps with deep regression networks. In ECCV, 2016. 1, 3

[19] M. Kristan et al. The visual object tracking vot2016 chal-lenge results. In ECCVW, 2016. 2, 6, 7

[20] M. Kristan et al. The visual object tracking vot2016 chal-lenge results. In ICCVW, 2017. 2, 6, 7

[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InNIPS, 2012. 2, 3, 6

[22] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu. High performancevisual tracking with siamese region proposal network. InCVPR, 2018. 1, 2, 3, 4, 5, 6, 7

[23] P. Li, D. Wang, L. Wang, and H. Lu. Deep visual track-ing: Review and experimental comparison. PR, 76:323–338,2018. 2

[24] Y. Li and J. Zhu. A scale adaptive kernel correlation filtertracker with feature integration. In ECCV, 2014. 8

[25] G. Lin, A. Milan, C. Shen, and I. D. Reid. Refinenet: Multi-path refinement networks for high-resolution semantic seg-mentation. In CVPR, 2017. 2

[26] T.-Y. Lin, P. Dollar, R. B. Girshick, K. He, B. Hariharan, andS. J. Belongie. Feature pyramid networks for object detec-tion. In CVPR, 2017. 2

[27] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. Focalloss for dense object detection. In ICCV, 2017. 2

[28] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y.Fu, and A. C. Berg. Ssd: Single shot multibox detector. InECCV, 2016. 1

[29] J. Long, E. Shelhamer, and T. Darrell. Fully convolutionalnetworks for semantic segmentation. In CVPR, 2015. 2, 3

[30] A. Lukezic, T. Vojir, L. C. Zajc, J. Matas, and M. Kristan.Discriminative correlation filter with channel and spatial re-liability. In CVPR, 2017. 7, 8

[31] C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang. Hierarchicalconvolutional features for visual tracking. In ICCV, 2015. 2,3, 6

[32] M. Mueller, N. Smith, and B. Ghanem. Context-aware cor-relation filter tracking. In CVPR, 2017. 8

[33] M. Muller, A. Bibi, S. Giancola, S. Al-Subaihi, andB. Ghanem. Trackingnet: A large-scale dataset and bench-mark for object tracking in the wild. In ECCV, 2018. 2, 7,8

[34] H. Nam and B. Han. Learning multi-domain convolutionalneural networks for visual tracking. In CVPR, 2016. 2, 7, 8

[35] Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, J. Lim, and M.-H.Yang. Hedged deep tracking. In CVPR, 2016. 6

[36] E. Real, J. Shlens, S. Mazzocchi, X. Pan, and V. Van-houcke. Youtube-boundingboxes: A large high-precisionhuman-annotated data set for object detection in video. InCVPR, 2017. 6

[37] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towardsreal-time object detection with region proposal networks. InNIPS, 2015. 1, 2, 4, 5

[38] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,et al. Imagenet large scale visual recognition challenge.IJCV, 115(3):211–252, 2015. 6

[39] A. Shrivastava, A. Gupta, and R. Girshick. Training region-based object detectors with online hard example mining. InCVPR, 2016. 2

[40] A. W. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara,A. Dehghan, and M. Shah. Visual tracking: An experimentalsurvey. TPAMI, 36(7):1442–1468, 2014. 2

[41] Y. Song, C. Ma, L. Gong, J. Zhang, R. W. Lau, and M.-H. Yang. Crest: Convolutional residual learning for visualtracking. In ICCV, 2017. 6

Page 10: Siamese Cascaded Region Proposal Networks for Real-Time ...Siamese Cascaded Region Proposal Networks for Real-Time Visual Tracking Heng Fan Haibin Ling Department of Computer and Information

[42] Y. Song, C. Ma, X. Wu, L. Gong, L. Bao, W. Zuo, C. Shen,R. Lau, and M.-H. Yang. Vital: Visual tracking via adversar-ial learning. In CVPR, 2018. 2, 7

[43] C. Sun, H. Lu, and M.-H. Yang. Learning spatial-aware re-gressions for visual tracking. In CVPR, 2018. 7

[44] R. Tao, E. Gavves, and A. W. Smeulders. Siamese instancesearch for tracking. In CVPR, 2016. 1, 3, 6

[45] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, andP. H. Torr. End-to-end representation learning for correla-tion filter based tracking. In CVPR, 2017. 1, 3, 6, 8

[46] A. Vedaldi and K. Lenc. Matconvnet: Convolutional neuralnetworks for matlab. 6

[47] P. Viola and M. Jones. Rapid object detection using a boostedcascade of simple features. In CVPR, 2001. 2

[48] L. Wang, W. Ouyang, X. Wang, and H. Lu. Visual trackingwith fully convolutional networks. In ICCV, 2015. 2, 3

[49] N. Wang and D.-Y. Yeung. Learning a deep compact imagerepresentation for visual tracking. In NIPS, 2013. 2

[50] Q. Wang, Z. Teng, J. Xing, J. Gao, W. Hu, and S. Maybank.Learning attentions: residual attentional siamese network forhigh performance online visual tracking. In CVPR, 2018. 1,3

[51] Y. Wu, J. Lim, and M.-H. Yang. Online object tracking: Abenchmark. In CVPR, 2013. 2, 6

[52] Y. Wu, J. Lim, and M.-H. Yang. Object tracking benchmark.TPAMI, 37(9):1834–1848, 2015. 2, 6

[53] A. Yilmaz, O. Javed, and M. Shah. Object tracking: A sur-vey. ACM CSUR, 38(4):13, 2006. 2

[54] T. Zhang, C. Xu, and M.-H. Yang. Multi-task correlationparticle filter for robust object tracking. In CVPR, 2017. 7

[55] Y. Zhang, L. Wang, J. Qi, D. Wang, M. Feng, and H. Lu.Structured siamese network for real-time visual tracking. InECCV, 2018. 7

[56] G. Zhu, F. Porikli, and H. Li. Beyond local search: Track-ing objects everywhere with instance-specific proposals. InCVPR, 2016. 7

[57] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu.Distractor-aware siamese networks for visual object track-ing. In ECCV, 2018. 1, 3, 6