Visual Tracking with Deformable Continuous Convolution ...liu.diva-portal.org/smash/get/diva2:1111930/FULLTEXT01.pdf · This thesis studies the computer vision problem of visual tracking

Master of Science Thesis in Electrical EngineeringDepartment of Electrical Engineering, Linköping University, 2017

Visual Tracking withDeformable ContinuousConvolution Operators

Joakim Johnander

Master of Science Thesis in Electrical EngineeringVisual Tracking with Deformable Continuous Convolution Operators

Joakim JohnanderLiTH-ISY-EX--17/5047--SE

Supervisor: Martin DanelljanISY, Linköping University

Examiner: Fahad KhanISY, Linköping university

Computer Vision LaboratoryDepartment of Electrical Engineering

Linköping UniversitySE-581 83 Linköping, Sweden

Copyright c© 2017 Joakim Johnander

Abstract

Visual Object Tracking is the computer vision problem of estimating a target trajectoryin a video given only its initial state. A visual tracker often acts as a component in theintelligent vision systems seen in for instance surveillance, autonomous vehicles or robots,and unmanned aerial vehicles. Applications may require robust tracking performanceon difficult sequences depicting targets undergoing large changes in appearance, whileenforcing a real-time constraint.

Discriminative correlation filters have shown promising tracking performance in re-cent years, and consistently improved state-of-the-art. With the advent of deep learning,new robust deep features have improved tracking performance considerably. However,methods based on discriminative correlation filters learn a rigid template describing thetarget appearance. This implies an assumption of target rigidity which is not fulfilled inpractice. This thesis introduces an approach which integrates deformability into a state-of-the-art tracker. The approach is thoroughly tested on three challenging visual trackingbenchmarks, achieving state-of-the-art performance.

iii

Abstract

Visuell följning är ett problem inom datorseende där ett objekt ska följas i en video givetendast dess begynnelseposition. En visuell följare agerar ofta som en komponent i deintelligenta datorseendesystem som återfinns exempelvis inom övervakning, autonomafordon eller robotar, och obemannade flygfarkoster. Tillämpningar kräver ofta robusthetunder svåra förhållanden där objekt kan genomgå stora förändringar i utseende. Samtidigtfinns ofta ett krav på realtid.

Diskriminativa korrelationsfilter har de senaste åren visat lovande resultat och varje årförbättrat högstanivån inom visuell följning. I och med djupinlärningens ankomst har mankunnat extrahera och använda sig av nya features och därigenom förbättrat prestandanavsevärt. Dock, metoder baserade på diskriminativa korrelations filter lär sig en oform-lig mall för att beksriva objektets utseende. Man har till viss del antagit att objekt ärstela vilket inte är sant i verkligheten. Denna tes introducerar en metod som integrerardeformerbarhet in i inlärningsramverket ifrån en befintlig state-of-the-art visuell följare.Den föreslagna metoden testas på tre utmanande dataset där prestandan uppmäts vara god.

v

Acknowledgments

This thesis has been written at the Computer Vision Laboratory at ISY, to which I amgrateful for having me. I want to extend my deepest gratitude to Dr. Fahad Shahbaz Khanfor providing me with an exciting thesis proposal, and very kindly provided insights. Iam equally grateful to Martin Danelljan who would provide ideas and assistance when-ever needed, discussed the topic and related topics with me, and patiently answered myinquiries. I also want to thank Goutam Bhat for helping me set up the tracking frameworkand discussing various topics with me.

Linköping, Maj 2017Joakim Johnander

vii

Contents

Notation xiii

1 Introduction 11.1 Brief Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.1 Practical Applications . . . . . . . . . . . . . . . . . . . . . . 41.3.2 Recent Research . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background 72.1 Visual Tracking Problem . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.2 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.1 Color Names . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.2 CNN Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Prior Work in Visual Tracking . . . . . . . . . . . . . . . . . . . . . . 122.3.1 Current State-of-the-Art . . . . . . . . . . . . . . . . . . . . . 122.3.2 Relation to Video Segmentation . . . . . . . . . . . . . . . . . 142.3.3 Deep Learning for Visual Tracking . . . . . . . . . . . . . . . 142.3.4 Part-based Approaches . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Discriminative Correlation Filters . . . . . . . . . . . . . . . . . . . . 152.4.1 Simple Correlation Filter . . . . . . . . . . . . . . . . . . . . . 152.4.2 Closed Form Solution . . . . . . . . . . . . . . . . . . . . . . 172.4.3 Equivalence of Convolution- and Correlation Operators . . . . . 172.4.4 Multiple Features . . . . . . . . . . . . . . . . . . . . . . . . . 182.4.5 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.4.6 Scale Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 202.4.7 Windowing of Features . . . . . . . . . . . . . . . . . . . . . . 212.4.8 Spatial Regularization . . . . . . . . . . . . . . . . . . . . . . 21

2.5 Continuous Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 232.5.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

ix

x Contents

2.5.2 Filter Application . . . . . . . . . . . . . . . . . . . . . . . . . 242.5.3 Objective Functional . . . . . . . . . . . . . . . . . . . . . . . 242.5.4 Filter Training . . . . . . . . . . . . . . . . . . . . . . . . . . 252.5.5 The Label Function . . . . . . . . . . . . . . . . . . . . . . . . 272.5.6 The Interpolation Function . . . . . . . . . . . . . . . . . . . . 282.5.7 Projection Estimation for Features . . . . . . . . . . . . . . . . 29

3 Method 333.1 Creating a Deformable Filter . . . . . . . . . . . . . . . . . . . . . . . 343.2 Objective Functional . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.3 Optimization Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3.1 Fourier Domain Formulation . . . . . . . . . . . . . . . . . . . 383.3.2 Filter Training . . . . . . . . . . . . . . . . . . . . . . . . . . 393.3.3 Subfilter Displacement Estimation . . . . . . . . . . . . . . . . 41

3.4 Subfilter Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.4.1 Practical Details . . . . . . . . . . . . . . . . . . . . . . . . . 423.4.2 Subfilter Selection . . . . . . . . . . . . . . . . . . . . . . . . 42

4 Experiments and Results 454.1 Evaluation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.1.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . 454.1.2 Running the Trackers . . . . . . . . . . . . . . . . . . . . . . . 474.1.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2 Baseline Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2.2 Subfilter Position Regularization . . . . . . . . . . . . . . . . . 49

4.3 Qualitative Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . 504.4 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.4.1 OTB-2015 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.4.2 TempleColor . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.4.3 VOT2016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.4.4 Attribute-based Comparison . . . . . . . . . . . . . . . . . . . 52

5 Discussion 535.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.2 Weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.2.1 Ghost Peaks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.2.2 Stuck in Local Minima . . . . . . . . . . . . . . . . . . . . . . 545.2.3 Spatial Extent of the Spatial Regularization . . . . . . . . . . . 555.2.4 Increased Number of Parameters . . . . . . . . . . . . . . . . . 555.2.5 Instability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.3.1 Subfilter Displacement Estimation as Detection . . . . . . . . . 565.3.2 Optimization Process Analysis . . . . . . . . . . . . . . . . . . 565.3.3 Reducing the Subfilter Size . . . . . . . . . . . . . . . . . . . . 565.3.4 Selecting Parts . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Contents xi

5.3.5 Dynamic Part Introduction and Removal . . . . . . . . . . . . . 575.3.6 Increasing Deformability . . . . . . . . . . . . . . . . . . . . . 57

6 Conclusion 59

A Derivations and Mathematical Details 63A.1 Filter Coefficient Optimization From Sec. 2.4.2 . . . . . . . . . . . . . 63A.2 Normal Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64A.3 Joint Normal Equations . . . . . . . . . . . . . . . . . . . . . . . . . . 65A.4 Fourier Transform of a Gaussian . . . . . . . . . . . . . . . . . . . . . 66A.5 Fourier Coefficients of a Function Repetition . . . . . . . . . . . . . . 68A.6 Fourier Transform of the Interpolation Function . . . . . . . . . . . . . 68

Bibliography 71

Notation

SETS

Notation ExplanationZ The set of integersR The set of real numbersC The set of complex numbers× Cartesian productKN The set of K ×K × · · · ×K of any field K, equipped with

the standard inner productK+ The set x ∈ K : x > 0L2(R) The set of functions which are Lebesgue integrable on RL2(RN ) The set of functions of N variables which are Lebesgue in-

tegrable on RNL2T (R) The set of periodic functions which are Lebesgue integrable

on [0, T ), where T ∈ R+, equipped with the standard innerproduct

L2T (RN ) The set of periodic functions of N variables which areLebesgue integrable on [0, T )N

`2 The space of square-summable sequences

LINEAR ALGEBRA

Notation Explanationu Finite complex-valued vectors, u ∈ CN , are denoted with

boldfaceU Finite complex-valued matrices, u ∈ CNM , are denoted

with capital lettersUu Matrix-vector productUV Matrix-productuT TransposeuH Conjugate transpose

xiii

xiv Notation

FUNCTIONS AND OPERATORS

Notation ExplanationF{u} The Fourier transform for u ∈ L2(R), or the operator grant-

ing the Fourier coefficients of u for u ∈ L2T (R)F−1{û} The inverse of F

DFT The discrete Fourier transformDFT−1 The inverse discrete Fourier transformû The Fourier coefficients of u ∈ L2T (RN ), Fourier transform

of u ∈ L2, or DFT of a discrete functionu · v Element-wise multiplicationu ∗ v Convolutionu ? v Cross-correlation〈u, v〉 The standard inner product between vectors u and v‖u‖ The Standard norm induced by the inner product‖u‖F The Frobenius norm defined for any finite matrix uǔ The mirroring of a function ǔ(t) = u(−t) for t ∈ Ru The complex conjugation of a function or vector u

ABBREVIATIONS

Notation ExplanationAUC Area Under the CurveCNN Convolutional Neural NetworkDCF Discriminative Correlation Filter

1Introduction

This thesis studies the computer vision problem of visual tracking. The aim is to estimatethe trajectory of a target in video. There are numerous applications and much attentionhas been shown for this problem. This thesis describes a problem and a state-of-the-artmethod to tackle it. The proposed method is based on discriminative correlation filters.

1.1 Brief Background

Computer vision is an interdisciplinary field of research treating problems where informa-tion is extracted from some type of visual input. Visual input includes, but is not limited to,RGB images, RGB-video, thermal images, three-dimensional laser scanners, event-basedcameras, computed tomography images and grayscale images. Output is often high-leveland semantic information such as the presence of an object category for instance in animage, automatic image captioning, tumor development assessment, driving lane detec-tion, or target trajectory estimation. Generic visual object tracking is a computer visionproblem where one is given a video and the initial position of an object as a bounding box.The goal is to estimate the object trajectory as it moves throughout the video. In practice,tracking is often part of a larger computer vision system where the initial bounding boxesare found by a detector picked for the current application. There are several subproblemsincluding occlusions, illumination changes, scale changes, in- and out of plane rotations,non-rigid deformations, difficult background and scene contamination. A tracker has toutilize the information given in the first frame to build a model which discriminates againstthe target and the background. Usually the model is updated in subsequent frames withsome kind of assumption that the target location is predicted correctly in the new frame.Figure 1.1 shows the result of the tracker, which is later proposed in this thesis, on twoexample videos.

The arguably simplest solution is to save the pixel values of the patch in the targetrectangle, and then find the target in the next frame by using some kind of similarity

1

2 1 Introduction

measure, for instance the mean square error, to match the extracted patch with all otherpossible patches. In fact, a basic signal processing technique to locate a given signal partin an unknown signal is to use correlation. As the image is a two-dimensional signal avery simple tracker can be constructed as a correlation filter. In the first frame it is setto the patch to be matched, and applied to subsequent frames where the largest responseis used as an estimate of the target location. These methods are very simple, and wouldfail to track the target through appearance changes induced by for instance rotations anddeformations.

The matched correlation filter is a simple example of a model for the target appearance,often referred to as appearance model. For most practical applications more sophisticatedappearance models are required. One fairly simple example is to use a histogram of the tar-get colors or some other information. Another example is to model the target appearancewith probability distributions such as a Gaussian Mixture Model. In general, the trackingproblem is often modelled using probabilistic methods [39][31][29] or as an optimizationproblem [5][22][13].

Visual tracking can be tackled with a simple model using good features. Such trackerstakes feature maps as input, rather than the raw grayscale or rgb images. In computervision a feature is a vector calculated at a point in an image via the application of somemapping. Often the features are calculated using a local region around the point. A featuremap consists of these features sampled on a grid and is therefore an image with a numberof channels. An example of a simple feature is the image gradient at a certain point.Another example which is commonly used is the color names (CN) [45] descriptor, avector of size 11 where each element is a probability that the pixel is of a certain color. Themost well-known feature in computer vision is arguably SIFT [32], but this is rarely usedin tracking. A feature reminiscent of SIFT is Histogram of Oriented Gradients (HOG) [9]which has seen widespread use in tracking algorithms as it is fast to compute and of smallsize.

In this thesis a discriminative correlation filter (DCF) based tracker is used as baseline.This is motivated by the fact that DCF based methods have achieved state-of-the-art per-formance across several tracking datasets [27][14][12][21][34][4]. A DCF is reminiscentof the previously described matched correlation filters. Target detection is performed ineach frame via a two-dimensional correlation filter applied to the input image. The filterresponse is treated as a detection score, where the location of the maximum value is usedas an estimate of the target position. In contrast to the matched filter, a DCF is trained onseveral previous frames before estimating the target position in the current frame. Train-ing the filter refers to using previous images and target positions to find a filter, usually byminimization or maximization of some quantity. An example of this quantity is

C∑c=1

‖f ? xc − yc‖22 (1.1)

where f is the filter, ? is cross-correlation, ‖ · ‖2 is the euclidean norm, xc the featuremap extracted from image frame c, and yc a Gaussian with a sharp peak located at theestimated or given target position. It is minimized over the filter coefficients f . By usingParseval’s formula the problem is solved very efficiently in the Fourier domain.

Despite reaching state-of-the-art performance, the discriminative correlation filterscontain an underlying assumption that targets are rigid and do not undergo rotations. This

1.2 Problem Formulation 3

Figure 1.1: Three frames of three sequences from the OTB-2015 dataset. In theinitial frame the bounding box is given, and a tracker attempts to estimate boundingboxes in the subsequent frames. These estimates are marked with green boxes.

assumption is often violated, for instance when the video depicts a running human orwhenever the video contains changes of perspective relative the object. These violationsare of no problem as long as large parts of the target appearance remains the same, orwhenever powerful features yield sufficient invariance to these transformations.

1.2 Problem Formulation

This thesis studies filter deformability in the context of visual tracking. Deformabilityshall be introduced to a state-of-the-art DCF-based tracker. There are various considera-tions to be made here. While the proposed method must allow deformations in the appear-ance model, prior knowledge of plausible deformations must most likely be incorporated.Furthermore, the sophisticated online learning process of the baseline must be adapted tothe introduced deformability. The proposed method shall be tested extensively to measuretracking performance. The method shall be tested using several commonly used datasetsdeveloped by the tracking community. Cases where tracking performance is increased ordecreased should be identified and discussed. Formulated in terms of questions, we ask:

How can support for target deformability be integrated into the appearance model?

4 1 Introduction

What are the effects to tracking performance? What are the effects?

1.3 Motivation

To motivate this study, two questions are posed. First, is visual tracking something neededby the society or the industry, more bluntly, can it yield something commercially viable?Second, is the study of deformable discriminative correlation filters of interest to a re-search community?

1.3.1 Practical Applications

Generic visual object tracking is a problem with numerous applications. In recent years,large companies such as Google and Tesla have looked into autonomous driving. AlsoSwedish companies such as Volvo and Autoliv have shown interest in the field. For au-tonomous driving, generic tracking proves useful by estimating trajectories of detectedobjects. It is also an area where being able to handle every scenario, including deforma-tions and rotations, is imperative and can be the deciding factor between life and death.

Unmanned Aerial Vehicles (UAV), or drones, have lately received attention from com-panies and hobbyists alike. These often come equipped with a camera, allowing the UAVsto make intelligent decisions on their own based on visual input. Amongst the deluge ofmore or less viable applications there are some notable examples. Drones could be usedto find and follow survivors of natural disasters. They could also be used to assist thepolice in finding and following suspects. Another use for this would be UAVs collectingfootage for movies or sports. The UAV could track and follow an actor or actress in amovie scene, or the player running to score in football.

Intelligent surveillance systems can also benefit from visual tracking. Surveillancesystems can refer both to systems used by companies, persons and the police to preventor solve crime. It can also refer to systems used by the industry during production. Inboth cases it may be desirable to count and keep track of objects, which can be done bydetecting and tracking them.

1.3.2 Recent Research

Recently a large amount of papers have been published in the area of visual object track-ing. The area has generated much attention and is well dissected at notable computervision conferences such as ICCV, ECCV and CVPR [27].

Several State-of-the-art trackers rely on DCF-based approaches and these have shownconsistent performance increases in several visual tracking benchmarks. Despite this, theyoften rely on an assumption of target rigidity, and inability to rotate. By using featurescontaining powerful semantic information they are often, but not always able to keep trackof targets undergoing large changes to appearance. Common targets, such as humansperforming various tasks, may prove difficult for these rigid DCF-based approaches. It istherefore desired to introduce deformability into the method. Staple [4] proposes to usecolor histograms which are fairly invariant to deformations, but not robust to any changesin illumination. Several recent works aim at introducing part-based information into the

1.4 Thesis Outline 5

DCF framework [31][29][33]. This often involves applying several filters and mergingthe information they provide via some additional components. These works have howevershown increases in performance. This thesis studies how to introduce deformability byrewriting the filter as a linear combination of subfilters, and merging the information ina single joint objective functional. The merit is that the tracker uses few components,solving a single optimization problem to track.

1.4 Thesis Outline

This thesis introduces the tracking problem in Sec. 2.1 and the challenges associated withit. Section 2.3 briefly reviews some current state-of-the-art tracking methods, some meth-ods of the closely related problem of video object segmentation, and lastly some methodsattempting to explicitly add deformability to DCF-based trackers. Discriminative cor-relation filters are introduced in Sec. 2.4, which begins with a simple formulation andcontinues with several improvements. Sec. 2.5 contains thorough theory of a learningframework utilized by a state-of-the-art tracker. In chapter 3 this framework is altered tointroduce deformability to DCF-based filters. Various considerations and method choicesare briefly discussed and motivated. Chapter 4 describes the experiments performed andlists results. The results and future work is discussed in chapter 5. The thesis is concludedwith chapter 6.

2Background

A discriminative correlation filter can be used to build a simple solution to the problemof visual tracking. This chapter describes the visual tracking problem and some impor-tant approaches to solve it. In Sec. 2.4 the theory of discriminative correlation filters isintroduced. The section begins with a simple formulation, used for instance by Bolme etal. [5]. This is followed by descriptions of several successful improvements, which areneeded to attain competitive performance.

With the advent of deep learning, powerful convolutional neural networks trained forimage classification tasks could be used as a pre-processing step for visual tracking. Theseresulted in large improvements to tracking performance, but combining the informationprovided by the different parts of the classification network proved challenging. The learn-ing framework introduced with C-COT [13] provided a method tackling this problem andattained top performance across several benchmarks. The method is based on discrimi-native correlation filters and utilizes the improvements described in Sec. 2.4. Section 2.5provides a thorough description of this learning framework, including two improvementsto C-COT introduced in [14].

2.1 Visual Tracking Problem

Generic visual object tracking is a fundamental computer vision problem which has re-ceived much attention by the industry and researchers alike. Given a video and the objectstate (x, y, height,width) in the first frame, a tracker should estimate the object state forall future frames. The problem is generic in the sense that no prior knowledge of theobject to be tracked is available. This is in contrast to scenarios where the type of targetis known beforehand.

7

8 2 Background

2.1.1 Datasets

The applications of visual tracking are many and diverse, with each application havingits own set of requirements. These can be real-time speed constraints, or performance re-quirements for a special class of videos. In practice, picking the best tracker is applicationspecific and will imply some kind of trade-off.

However, to be able to properly and in a fair manner, test and compare generic visualobject trackers, the community has developed datasets and evaluation metrics. Any pub-lication within the field is expected to evaluate their method on these collective datasets.The method proposed in this thesis is evaluated on three such datasets: OTB-2015 [48]consisting of 100 videos, TempleColor [30] consisting of 128 videos, and VOT2016 [27]consisting of 60 videos. The datasets are diverse and challenging, containing a large setof scenes and targets. Some experiments of this thesis are performed using OTB-2015 orTempleColor. Evaluation of the final tracker is done across all three datasets however.

2.1.2 Attributes

To be able to analyze a tracker it is necessary to understand why a tracker may fail. Thedatasets are rich with instances where the target to be tracked undergoes large changes inappearance or when the scene contains difficult background. This section briefly describesseveral types of such tracking challenges. In the literature, these are often referred to asattributes.

Background Clutter In many cases the features of a target may look much like thebackground. Consider tracking a face when there is a crowd in the background. Smallchanges in target appearance may result in a part of the background looking more similarto the target than the target itself. If the tracker sees and is trained on a face for severalframes a simple out-of-plane rotation of the target may yield a situation where a facein the crowd looks more like the target than the target itself. Any situation where thebackground contains similar features to that of the target can result in a tracker failure.

Deformations Non-rigid deformations forms a large set of transformations which atarget can undergo. One of the most common examples is a target consisting of fairlyrigid parts which move and rotate with respect to each other, such as a human. Humansare deformable but under most circumstances they are piecewise rigid. Another exampleof piecewise rigidity are flying birds where each wing, and the center part, forms threefairly rigid parts. An example where the target contains little to no rigidity is a swimmingoctopus.

Fast Motion Several videos depict targets which moves very quickly with respect totheir size. An oftentimes utilized prior in the visual tracking problem is that a targetrarely moves very far between two subsequent frames. This prior information can stop atracker from switching between two similar targets, which can otherwise be the case forinstance in a video showing competing 100m runners. However, utilizing this informationin videos showing fast motion, such as several scenes of “The Matrix”, may cause atracker to lose its target.

2.2 Features 9

Illumination Variations The tracked target may be subject to spatial and temporal il-lumination variations. The scene may contain different lighting conditions in differentplaces and the target may suffer from events such as lightning flashes or moving lights.The impact of such effects can be attenuated by utilizing features invariant to changes inillumination, such as HOG [9].

In-plane Rotations In-plane rotations are rotations occurring in the 2D image plane.An example of an in-plane rotation is a motorcyclist performing a back flip, seen fromthe side. There is prior information available for targets undergoing such transformations,namely that the appearance remains the same, just rotated. This is however rarely utilized,and in-plane rotations proves challenging for many trackers. The DCF-based trackerscontain an assumption that targets do not rotate, and other trackers such as LOT [39] viewall changes in appearance as noise.

Low Resolution Another challenging attribute is low resolution. Low resolution re-duces the available information of the target which may reduce the ability to discriminatebetween two targets. It is however common, both due to cheap cameras and due to largedistances to targets.

Motion Blur Motion blur occurs when a target moves quickly, drastically changing thetarget appearance into a smudged mess.

Occlusions Targets are often covered by something else. This can be seen in sequencesshowing a person walking behind a car, or a tiger moving behind some vegetation. Thisattribute is referred to as occlusions and they may be partial in the sense that only a partof the target is occluded, or full.

Out-of-Plane Rotations Contrary to in-plane rotations, out-of-plane rotations are notin the image plane, that is, the rotation vector is not perpendicular to the image plane.This results in some parts of the target disappearing, and others appearing.

Out-of-View Videos may show a target moving beyond the image border, disappearingfor a set of frames, displaying the out-of-view attribute. Many trackers become unstablewhen the target disappears and move around the video looking for the target, but areunable to locate the target when it reappears.

Scale Variations Often the target moves closer or further away from the camera, chang-ing in scale. This problem is commonly remedied by an attempt to estimate the scale andthen rescaling the input image.

2.2 Features

For many tasks in computer vision special kinds of features are used. These are elementswhich are extracted from the image. A feature is a vector calculated at a point in the

10 2 Background

Figure 2.1: Five sequences where a state-of-the-art tracker attempts to estimate thetarget trajectory. The first column contains an early frame of each shown sequence.First row (from above) shows four frames of a scene from the movie “The Ma-trix”. It displays illumination variation and background clutter. Second row showsa “Transformer”, an example of a deformable target. Third row consists of fourframes depicting “Ironman”, displaying the attributes fast motion and motion blur.The second and third frames are subsequent. The fourth row shows a low-resolutiondiver performing in-plane rotations. The fifth row shows a sequence containing scalevariations, occlusions, and out-of-plane rotations.

input image using the local region. A feature map consists of these features sampled ina grid. Therefore feature extraction can be seen as a map from the input-image-space toRabc where a and b are the number of pixels in each row and column respectively; andc is the number of feature dimensions for that feature. Hence each image is transformedinto another image, possibly with another amount of channels, and possibly with anotherresolution.

There are various popular choices of features. Some examples are color names [46],Histogram of Oriented Gradients (HOG) [9], and features extracted from deep convolu-tional neural networks (CNN). In this thesis, color names and features extracted from aCNN are used.

2.2 Features 11

2.2.1 Color Names

The idea of color names [46] is to yield a linguistic description of the color at a givenposition in the image. At each position, that is, at each point in the feature map there isan 11 element feature vector. Each element of the vector corresponds to a color and isthe probability that this given point in the image would be referred to as that color. Thesmall size of this feature and its discriminative power [11] makes it attractive for visualtracking.

2.2.2 CNN Features

During recent years convolutional neural networks (CNN) have shown remarkable resultsin several computer vision research areas. A CNN is a sequence of operations where thetype of operation is specified in advance, but its parameters are trained on large datasets.The sequence of operation types is usually referred to as network architecture. The ar-chitecture is divided into layers where each layer performs one operation type. The mostcommon layers follow:

Convolution Layer - Contains several convolution filters. Each filter is applied to theinput, one image per feature channel, and results in one image per convolutionfilter. The size of the convolution filters can vary between layers, but is usuallyfairly small, such as 3× 3.

Max Pooling Layer - Usually yields a downsampled output. For each pixel in the down-sampled image, look at a neighbourhood centered at the corresponding point in theinput image. The value of this pixel is picked as the maximum in the correspondingneighbourhood. The size of the neighbourhood is typically 2×2, and the downsam-pling factor is typically 2. The operation is illustrated in Fig. 2.2.

Flatten Layer - Reshapes the input into a vector, preserving the elements. The inputis usually a set of images, viewed as a third order tensor, which has been passedthrough a sequence of convolution and max pooling layers.

Fully Connected Layer - Can be viewed as a general linear mapping given by a matrix.

Non-linear Activation Function - Each convolutional and fully connected layer is fol-lowed by a non-linear activation function such as the commonly used rectified lin-ear unit (ReLU). ReLU is defined as f(x) = max(0, x). The activation functionis applied element-wise to the input. For some applications, such as image clas-sification, the desired output is a set of probabilities. Each element in the outputwould correspond to a class and would be viewed as the probability that the classi-fied image is of that class. This is obtained with the softmax function, defined asf(xj) = e

xj/∑j exj .

Each convolution layer contains a set of convolution filters. Each filter coefficient isa parameter in the network. In a similar fashion, each fully connected layer contains amatrix, which represents the linear mapping. Each element of this matrix is a parameterin the network. These parameters are learnt on a dataset via some optimization scheme.

12 2 Background

3 1

9 4

1 1

3 3

2 3

2 7

5 4

3 1

9 3

7 5

Figure 2.2: Shows the operation of max pooling on an input image of size 2 × 2.The max pooling is of size 2× 2 with a downsampling factor of 2. The input imageis to the left, the output image is to the right.

As an example, in image classification a batch of several images is fed into the network,each labeled with a correct class. By calculating the gradient of the class prediction errorwith respect to the parameters, the parameters can be updated by for instance gradientdescent. In practice a modified gradient descent is usually used. Figure 2.3 shows theVGG16 [43] architecture which is used for image classification. It is trained on ImageNet[15] which contains over a million images.

For visual tracking, the input image can be fed into a convolutional neural networkalready trained for some other computer vision task. The output after different convolu-tional layers are images with a varying amount of feature channels. Usually the resolu-tion decreases while the number of feature channels increases as the image is propagateddeeper through the network. Recent state-of-the-art trackers have shown that using the out-put of a convolutional layer as a feature map yields great gains in performance [35][26].

2.3 Prior Work in Visual Tracking

Visual object tracking receives much attention from the research community and the ap-proaches used in the state-of-the-art methods varies greatly. The research regarding theclosely related problem of video segmentation is important to consider as any solution tothat problem problem infers a solution to visual object tracking.

2.3.1 Current State-of-the-Art

Currently, the top performing trackers use a wide and varied set of techniques. The MLDF-and SSAT [36] trackers trains a CNN online to predict the target position in each frame,using features from another network trained for image classification. TCNN [38] locatesthe target by averaging predictions made by multiple CNNs. Staple [4] merges informa-tion provided by a discriminative correlation filter and a color histogram. C-COT [13] is acontinuous correlation filter using features extracted from a CNN trained for image classi-fication, adding a spatial regularization to the filter learning. These five trackers were thetop performers on the recent Visual Object Tracking (VOT) challenge 2016 [27]. In this

2.3 Prior Work in Visual Tracking 13

Input Image

Size 224x224x3Convolution 3x3, ReLU

Convolution 3x3, ReLU

Max Pool 2x2, 2Size 112x112x64



Max pool 2x2, 2




Max pool 2x2, 2




Max pool 2x2, 2




Max pool 2x2, 2

Size 7x7x512Flatten

Fully Connected, ReLU

Fully Connected, ReLU

Fully Connected, Softmax

Size 1x1x1000Output Vector

Figure 2.3: An example CNN architecture. This is the VGG16 image classificationnetwork. To the left are the performed operation types. To the right are intermediatedata sizes. A size 224 × 224 RGB image is used as input. As RGB images have 3channels it is viewed as an array of size 224× 224× 3. Output is a vector of length1000.

14 2 Background

thesis the latter will be used as a foundation, adding deformability to the correlation filter.

2.3.2 Relation to Video Segmentation

Furthermore, a natural extension to the problem of estimating bounding boxes is to es-timate a full object segmentation in each frame. With a segmentation a bounding boxis easily estimated, and the segmentation in itself can be useful for some applications.There are several techniques for video segmentation. NLC [16], JOTS [47] and ObjectFlow [44] are graph-based techniques. FusionSeg [23] consists of two CNNs, one receiv-ing and RGB image as input, and the other taking a precalculated optical flow as input.MaskTrack [24] trains a CNN which predicts an object mask in each frame. Their CNNreceives both an RGB image and a segmentation mask from the previous frame.

These methods are however usually very slow, often requiring several seconds or min-utes per frame. In visual tracking, real-time efficiency is not always attained, but oftendesired. Too slow methods are deemed infeasible. Furthermore, the datasets used forevaluation in this field are different from the challenging datasets seen in visual tracking.Commonly used are SegTrack-v2 [28], DAVIS [40] and YoutubeObjects [6][41]. The for-mer two are fairly small datasets both in terms of number of sequences, and in terms ofsequence length. The latter is large, but comprises only ten classes. These datasets alsooften contain situations where the target is larger than the frame, which is generally notthe case in visual tracking. The scope of this thesis is limited and little attention is givento video segmentation. Instead we focus only on visual tracking, which is closely related.

2.3.3 Deep Learning for Visual Tracking

Recently many computer vision problems have seen a surge of approaches based on DeepLearning. Recent advances in the field and the massive processing power of graphics cardshave lead to the possibility of training complex models with big data. Image classificationtasks can be solved by a Convolutional Neural Network (CNN) trained on millions ofimages. The research community has shown much interest in whether neural networks canoutperform the current state-of-the-art trackers, and how this could be done. The currentstate-of-the-art trains a CNN online to classify the target as foreground and anything elseas background. Problems such as occlusions and background clutter will however enforcesome additional component to be incorporated to such a tracker. To solve this a timecomponent would need to be incorporated into the model. One intuitive idea would beto investigate Recurrent Neural Networks (RNN). A common question is whether thetracking problem can be solved end-to-end by training a neural network.

2.3.4 Part-based Approaches

A recurrent idea in visual tracking is to divide the target into parts and have several smalltrackers track each part. Tracking a smaller part of the image deteriorates performancehowever, as there are less good features to utilize. This can easily be seen by consideringfor instance the problem of tracking a leg and the problem of tracking an entire body ina video showing a crowd. Additionally, partial occlusions may become full occlusions,the parts may lack discriminative features to track, and the parts may move very quickly

2.4 Discriminative Correlation Filters 15

relative their size. To reduce the effects of these problems information of the differentparts needs to be shared, for instance by constraining their movements relative each otherwith a system of springs.

The work of [31] utilizes several kernelized correlation filters (KCF) [22] each track-ing a part of the target. The filters are combined in a probabilistic framework by consider-ing the part filter locations as an observation of the target state. By weighting the part filterresponses depending on the peak-to-sidelobe-ratio (PSR) and disabling part filter updateswhenever the PSR is too low they gain robustness versus partial occlusions. Li et. al. uti-lizes several KCF where the particle filter is used to merge information. The deformableparts tracker (DPT) [33] utilizes several KCF constrained by a system of springs. Thepredicted part positions are merged with the spring model, combining target appearancewith a geometric regularization. These methods all build on the KCF which is a highlycomputationally efficient filter with good tracking performance. Additional componentsare added to merge the responses provided by the filters.

2.4 Discriminative Correlation Filters

Trackers based on discriminative correlation filters have recently achieved great successin several object tracking benchmarks [27]. These trackers rely on learning a set of 2-dimensional correlation filters in the first frame. The tracker will then predict a boundingbox in subsequent frames and usually also updates the filters as new boxes are predicted.

2.4.1 Simple Correlation Filter

Arguably, the simplest correlation filter is a single matrix [5], denoted f [k1, k2]. Eachvideo frame a new sample x is received. In this case, the sample corresponds to a two-dimensional array corresponding to a part of a grayscale image. Denote the sample size asN1 ×N2. f is of the same type and size, that is, an N1 ×N2 array. The filter f is trainedon the hitherto received samples. Denote this set of samples {x1, x2, . . . , xC}. Here, wenote that training typically occurs every m frames during the whole sequence. The set ofsamples can be all hitherto received samples, or an intelligently selected subset.

Define a detection score operator

Sf{x}[k1, k2] = (f ? x)[k1, k2], (2.1)

where ? denotes circular cross-correlation. We desire the score operator to produce asharp peak at the target location in the frame corresponding to the sample x. An exampleof this can be seen in Fig. 2.4 For efficiency the score is calculated in the Fourier domain,using the Discrete Fourier Transform (DFT),

Sf{x}[k1, k2] = DFT−1{DFT{f}DFT{x}}[k1, k2], (2.2)

where · denotes the complex conjugate operation. The detection score is used to estimatethe target position by locating its maximum. As the parameter space is very small a sim-ple grid-search can do this. Via interpolation, sub-pixel accuracy can be achieved, thatis, rather than finding the maximum of S{x}[k1, k2] one finds the maximum of an inter-polated continuous version S{x}(t1, t2). This is done for instance by Newton’s method,

16 2 Background

Figure 2.4: The score operator Sf applied to the sample overlaid on the input image(Left). The label function y overlaid on the input image (Right). Sf was trained totrack the diver on the initial frame.

where by performing all calculations in the Fourier domain an implicit interpolation isgained.

Hitherto a filter f has been used to detect the target position. Before doing so fmust be trained. This can be done by training the score operator applied to sample xc

corresponding to frame c, Sf{xc}, to yield a Gaussian function with a peak centered atthe target location. That is, define the Gaussian label

y[k1, k2] =1√2σ2

e−k21+k

22

2σ2 (2.3)

and shifted versions

yc[k1, k2] = τpcy[k1, k2] = y[k1 − pc1, k2 − pc2], (2.4)

where τpc is a translation operator which centers the peak at the target location pc =(pc1, p

c2) in frame c. To train the tracker we find the filter f which minimize the objective

functional

�(f) =

C∑c=1

αc‖Sf{xc} − yc‖22 =C∑c=1

αc‖(f ? xc)− yc‖22 (2.5)

where αc is a weight applied to each frame, and ‖ · ‖2 is the Euclidean norm. For compu-tational efficiency we apply Parseval’s theorem to get

�(f) =

C∑c=1

αc‖f̂ x̂c − ŷc‖22, (2.6)

which is a linear least squares problem with a closed form solution. Here ·̂ denotes theDiscrete Fourier Transform (DFT) of a finite discrete signal.

What has been described is a simple correlation filter which can be used as a visualtracker. There are two parts to this. When the initial sample is received the target location


(p11, p12) is known. The filter coefficients f are then calculated by minimizing (2.6). In the

subsequent frame, a new sample is received. The estimated filter is then used to estimate(p21, p

22), followed by an estimation of f . This is repeated and yields a visual tracker which

may work well for simple cases. However, most real sequences displays one or several ofthe attributes described in Sec. 2.1.2 and this will cause the tracker to fail.

2.4.2 Closed Form Solution

The objective (2.6) is minimized using linear least squares. A similar derivation is per-formed by Bolme et al. [5]. To find a stationary point of the objective the derivative istaken with respect to each of the filter coefficients, and set to zero. Note that the objectiveis real, and the derivative is taken with respect to both the real part and the imaginary part.Explicitly writing out the norm and changing the order of summation in (2.6) yields

�(f) =∑k1,k2

C∑c=1

αc∣∣∣f̂ [k1, k2]x̂c[k1, k2]− ŷc[k1, k2]∣∣∣2 (2.7)

For a given k1, k2 the term

C∑c=1

αc∣∣∣f̂ [k1, k2]x̂c[k1, k2]− ŷc[k1, k2]∣∣∣2 (2.8)

depends only on f̂ [k1, k2] and no other filter coefficient. As shown in A.1, it is minimizedif and only if f̂ [k1, k2] satisfies

0 =

C∑c=1

αc(|x̂c[k1, k2]|2f̂ [k1, k2]− x̂c[k1, k2]ŷc[k1, k2]). (2.9)

Utilizing this, each optimal filter coefficient is found as

f̂ [k1, k2] =

∑Cc=1 α

cx̂c[k1, k2]ŷc[k1, k2]∑Cc=1 α

cx̂c[k1, k2]x̂c[k1, k2]. (2.10)

This closed form solution is available for this simple case. More intricate approaches tothe visual tracking problem will require iterative optimization methods.

2.4.3 Equivalence of Convolution- and Correlation Operators

For mathematical convenience it is useful to use convolution operators rather than cor-relation operators. We shall show that the problem solved and the result is equivalentwhen exchanging the correlation operators for the more convenient convolution operators.Consider u, v ∈ L2(T ). Since

(u ?T v)(t) =1

T

T∫0

u∗(τ)v(t+ τ)dτ (2.11)

18 2 Background

and

(u ∗T v)(t) =1

T

T∫0

u(τ)v(t− τ)dτ (2.12)

we have

(u ?T v)(t) =1

T

T∫0

ǔ(−τ)v(t+ τ)dτ = 1T

T∫0

ǔ(τ)v(t− τ)dτ = (ǔ ∗T v)(t) (2.13)

where ·̌ denotes the mirroring operation. This means that a convolution filter is just themirror image of an equivalent correlation filter. Instead consider u, v : Z → C such that(u ∗ v)[k] is defined, and where v is periodic, with period K. This yields

(u ? v)[k] =

K−1∑l=0

ǔ[−l]v[k + l] =0∑

l=−(K−1)

ǔ[l]v[k − l] =

=

K−1∑l=0

ǔ[l]v[k − l] = (ǔ ∗ v)[k]. (2.14)

Again, this means that by mirroring the first element in the convolution, correlation isachieved. This shall be used to instead train a convolution filter. Note that this also appliesin two dimensions, that is, when u, v ∈ L2(R2) or u, v : Z2 → C such that (u∗ v)[k1, k2]is defined, and where v is periodic with periods K1,K2.

2.4.4 Multiple Features

The convolution filter can be improved by allowing multiple feature channels. Con-sider the sample x to be a tensor of order 3, of size N1 × N2 × D. We denote x =(x1, x2, . . . xD). This sample can for instance be a part of an RGB-image with D = 3.In practice this will correspond to some other feature map extracted from the image, suchas HOG, color names or features extracted by propagating the image through the initiallayers of a convolutional neural network. We generalize the detection score operator tothe multiple feature case as

Sf{x}[k1, k2] =D∑d=1

(fd ∗ xd)[k1, k2]. (2.15)

The individual feature responses in each feature dimension is illustrated in Fig. 2.5. Thefilter optimization is again performed in the Fourier domain for reasons of efficiency. TheDFT of the detection score is found as

Ŝf{x}[k1, k2] =D∑d=1

f̂d[k1, k2]x̂d[k1, k2] (2.16)


Figure 2.5: The detection score Sf is found by summing the filter responses (fd ∗xd) of each dimension d. Here the filter responses in four dimensions are shown,overlaid the input image. In this case, the feature maps xd are extracted from thefirst convolutional layer of a classification CNN.

yielding the objective

�(f) =

C∑c=1

α‖Ŝf{xc} − ŷc‖22 =C∑c=1

∥∥∥∥∥(

D∑d=1

f̂dx̂cd

)− ŷc

∥∥∥∥∥2

2

. (2.17)

The current state-of-the-art visual trackers show that it is necessary to use some kindof multi-dimensional features which are robust to changes in appearance. While this im-proves tracking performance, the training process becomes more complex and there is noclosed-form solution corresponding to (2.10) available. Instead some other optimizationmethod is utilized. In the learning framework proposed with C-COT the method of leastsquares is utilized together with a linear system of equations solver.

2.4.5 Detection

When receiving a new frame, indexed c, the trained filter should be applied to that frameto predict the target position. This is followed by a filter update, which requires labels forall hitherto received frames. The new label ŷc is created by centering a Gaussian function

20 2 Background

at the estimated target position. Utilizing the detection scores, we find the target positionas

arg maxk1,k2

Sf{xc}[k1, k2] (2.18)

by performing a grid search over k1, k2. The solution is then refined with sub-pixel ac-curacy by applying a few steps of Newton’s method. To do this, define a continuousscore operator by performing the calculations in the Fourier domain, yielding an implicitinterpolation,

Sf (t1, t2) =1

N1N2

∑k1,k2

Ŝ{x}[k1, k2]ei2π

(k1N1t1+

k2N2t2). (2.19)

We find the gradient

∇Sf (t1, t2) = i2π1

N1N2

∑k1,k2

Ŝ{x}[k1, k2]ei2π

(k1N1t1+

k2N2t2)(

k1/N1k2/N2

) , (2.20)the Hessian

Sf (t1, t2) = i2π1

N1N2

∑k1,k2

Ŝ{x}[k1, k2]ei2π

(k1N1t1+

k2N2t2) k21N21 k1k2N1N2

k1k2N1N2

k22N22

,(2.21)

and can now use Newton’s method to refine the coarsely estimated optimum found withthe grid search.

2.4.6 Scale Estimation

The tracked target may move towards or away from the camera, resulting in a change ofscale. In such a scenario, the simple DCF may lose track of its target as its rigid templateno longer matches. Even if it would successfully track its target, for instance due torobustness provided by the features, the resulting bounding box will of the wrong size.Both of these problems can be solved by estimating the target scale and then rescaling thegiven sample [10].

During the detection stage where the target location is estimated, we also estimate thetarget scale. This is done by resampling the given image in different scales. Feature mapsare extracted, yielding a sample for each scale. The score operator Sf is then appliedto each sample. For each corresponding detection score, the target location is estimatedas described in Sec. 2.4.5. The sample with the highest detection score is utilized. Thecorresponding scale is used as an estimate of the target scale. The following filter opti-mizations will use this sample. By doing this, the target will remain of the same size inall samples. The resampled images and their corresponding detection scores are seen inFig. 2.6.


Figure 2.6: The upper row shows the input image sampled at three different scales.The lower row shows the detection scores Sf{x} for each scale. The rightmostimage yields the maximum score and its scale is used as an estimate.

2.4.7 Windowing of Features

There is prior information about the target movement. Positioning systems often modelthis prior by assuming a motion model and then fuse this information with the measure-ments which in our case would be the estimated target positions. When working withgeneric tracking the motion model may however be unreliable for two reasons: we aretracking generic targets which may move very differently, and since the camera may alsomove.

Another issue is border effects. Since computations are performed in the Fourierdomain there is an implicit periodic extension of the video frame. This means that closeto the frame borders the scores will display artifacts due to this periodic assumption. Insignal processing this effect is commonly addressed by applying a window to the signal.This removes energy of samples near the borders. Primary this will remove border effects.This will also result in a prior that the target will not move too far between two consecutiveframes.

2.4.8 Spatial Regularization

Furthermore the tracker may learn parts of the background as parts of the target. In thecase of discriminative correlation filters, high weights may be assigned to filter coeffi-cients corresponding to background. To alleviate this, the DCF-based tracker SRDCF[12] introduced a spatial regularization. In their learning formulation, a cost was added toeach coefficient which increased with the distance from the filter center. Another perk to

22 2 Background

Figure 2.7: To the left is an input image. In the middle is one dimension of a featuremap extracted from a convolutional neural network. To the right is the windowedfeature map.

Figure 2.8: An example of the filter in one dimension fd overlaid on the input image.Here the filter was trained on the region containing the coke.

this is that boundary effects may be reduced. Alter (2.17) by adding a regularization term

�s =

D∑d=1

‖w · fd‖2 (2.22)

to the objective, which then becomes

�(f) =

C∑c=1

α∥∥∥Ŝf{xc} − ŷc∥∥∥2

2+

D∑d=1

∥∥∥ŵ ∗ f̂d∥∥∥22. (2.23)

Here w is a window with high values close to the borders, heavily regularizing filter coef-ficients placed there. Note how �s becomes a convolution when expressed in the Fourierdomain. It is desirable to only represent the filter coefficients in the Fourier domain andnot be forced to switch domains. Hence for computational efficiency it is important to beable to represent the spatial regularization window with few coefficients.

2.5 Continuous Formulation 23

2.5 Continuous Formulation

A key strategy is developed by Danelljan et al. [13] is to define the filter as a set ofcontinuous periodic functions, one for each feature dimension. These are applied to thesamples by defining an interpolation function which transforms them to the continuousdomain. Previously a major challenge was how to merge information obtained from fea-ture maps of different resolutions. Powerful discriminative features are often of differentscales, as is the case for features extracted from different layers of a convolutional neuralnetwork. The shallow layers contains high resolution low-level information such as edges,and the deeper layers contains low resolution high-level information. The shallow layershave been shown to contain information which is very suitable for tracking while featuresextracted from the deeper layers can be very robust to large changes in target appearance.

2.5.1 Definitions

Several new definitions as well as some redefinitions are needed,

Description NotationCont. variables t = (t1, t2) ∈ R2

Disc. variables n = (n1, n2) ∈ Z2

Four. domain variables k = (k1, k2) ∈ Z2

Num. disc. variables Nd = (Nd1 , Nd2 ) ∈ N2

Four. coefficients limit Kd = (Kd1 ,Kd2 ) ∈ Z2

Support region T = (T1, T2) ∈ R2

Samples index c = 1, . . . , CFeat. dim. index d = 1, . . . , DFilter f = (f1, f2, . . . , fD) ∈ L2T (R2)D

Sample space X = (X1, . . . ,XD) = (RN11N

12 , . . . ,RND1 ND2 )

Sample xcd ∈ XdSpat. reg. w ∈ L2T (R2)Interpolation function bd ∈ L2T (R2)Interpolation operator Jd : Xd → L2T (R2)Score operator Sf : X → L2T (R2)Label yc ∈ L2T (R2)Sample weight αc ∈ [0, 1)Filter position qc ∈ R2

24 2 Background

2.5.2 Filter Application

To be able to apply the continuous filter to the discrete samples, an interpolation operatorJ is applied to the samples

Jd{xd}(t1, t2) =Nd1−1∑n1=0

Nd2−1∑n2=0

xd[n1, n2]bd

(t1 −

T1Nd1

n1, t2 −T2Nd2

n2

). (2.24)

Here xcd is the sample taken in time instance c, along feature dimension d. bd is the

interpolation function. Nd = (Nd1 , Nd2 ) is the number of points of that sample, along

both spatial dimensions. T = (T1, T2) is the arbitrary but fixed period of the interpolatedsamples and the filters. t = (t1, t2) is the continuous spatial variable and n = (n1, n2)the discrete spatial variable. The detection score operator is redefined as

Sf{x} =D∑d=1

fd ∗ Jd{xd} (2.25)

where fd ∈ L2(T ) is the filter and D the number of feature dimensions. This results in acontinuous score for any given input sample, which will be used to localize the target.

2.5.3 Objective Functional

The objective is redefined as

�(f) =

C∑c=1

αc‖Sf{xc} − yc‖2 +D∑d=1

‖w · fd‖2 (2.26)

where w ∈ L2(T ) is the spatial regularization. αc ∈ [0, 1) is the learning rate for sampletaken at time instance c. yc ∈ L2(T ) is the label function, for instance a Gaussian centeredabout the target. fd ∈ L2(T ) is the convolution filter in dimension d.

Given a set of samples and labels the functional shall be minimized. As previously,this is be done in the Fourier domain. This time however, the Fourier coefficients areutilized rather than the DFT. This is necessary since a discrete representation is required.First we find the Fourier coefficients for the score function as

Ŝf{xc} =D∑d=1

f̂dĴd{xcd} (2.27)

where f̂d and Ĵd{xcd} are the Fourier coefficients of f and Jd{xcd} respectively. The latter


is found as

Ĵd{xcd}[k1, k2] = F

Nd1−1∑n1=0

Nd2−1∑n2=0

xcd[n1, n2]bd

(t1 −

T1Nd1

n1, t2 −T2Nd2

n2

) ==

Nd1−1∑n1=0

Nd2−1∑n2=0

xcd[n1, n2]F{bd(t1 −

T1Nd1

n1, t2 −T2Nd2

n2

)}=

= b̂d[k1, k2]

Nd1−1∑n1=0

Nd2−1∑n2=0

xcd[n1, n2]e−i2π(T1n1/T1Nd1 +T2n2/T2N

d2 ) =

= b̂d[k1, k2]DFT{xcd}[k1, k2]. (2.28)

The objective can now be written in the Fourier domain, using Parseval’s formula,

�(f) =

C∑c=1

αc

∥∥∥∥∥D∑d=1

DFT{xcd}b̂df̂d − ŷc∥∥∥∥∥

2

+

D∑d=1

‖ŵ ∗ f̂d‖2. (2.29)

In practice the filter representation must be finite. The objective (2.29) is discrete and istruncated to fulfill this requirement.

2.5.4 Filter Training

Equation (2.29) is minimized using the normal equations and the method of conjugategradients. The first step is to vectorize the objective. The Fourier coefficients represent-ing the filter are truncated. This leads to a finite approximation which is intuitive as thecoefficients close to the center tend to contain the most energy. We use 2Kd + 1 coef-ficients centered about 0, where Kd = bN

d

2 c. Also define K = maxdKd. Define a∑D

d=1Kd1K

d2 sized vector

f̂ =

f̂1f̂2...

f̂D

(2.30)where

f̂d =

f̂d[−Kd1 ,−Kd2 ]...

f̂d[−Kd1 ,Kd2 ]...

f̂d[Kd1 ,K

d2 ]

, (2.31)

26 2 Background

and a C(2K + 1)×∑Dd=1(2K

d1 + 1)(2K

d2 + 1) sized matrix

A =

A1A2...AC

(2.32)where

Ac =(Ac,1 Ac,2 . . . Ac,D

), Ac,d = diag

DFT{xcd}b̂d[−Kd1 ,−Kd2 ]...

DFT{xcd}b̂d[−Kd1 ,Kd2 ]...

DFT{xcd}b̂d[Kd1 ,Kd2 ]

.(2.33)

Here, “diag” is an operator transforming a vector to a corresponding diagonal matrix.Furthermore, define a size C(2K1 + 1)(2K2 + 1) vector

ŷ =

ŷ1

ŷ2

...ŷC

(2.34)where

ŷc =

ŷc[−K1,−K2]

...ŷc[−K1,K2]

...ŷc[K1,K2]

. (2.35)

Also define

Γ =

α1I2K+1

α2I2K+1. . .

αCI2K+1

. (2.36)Lastly, the spatial regularization is rewritten as

D∑d=1

ŵ ∗ f̂d = W f̂ (2.37)


where W is the block-diagonal matrix where each block is a convolution matrix contain-ing the elements of ŵ and corresponds to the convolution.

We can now rewrite the objective functional (2.29) into

�(f) =

C∑c=1

αc‖Acf̂ − ŷc‖22 + ‖W f̂‖22 =

= ‖√

Γ(Af̂ − ŷ)‖22 + ‖W f̂‖22 (2.38)

which can be solved using the method of least squares. Here√· denotes the element-

wise square root. As shown in Sec. A.2, the objective is minimized by solving the normalequations

(AHΓA+WHW )f̂ = AHΓŷ. (2.39)

In practice this system may contain a number of equations in the order of 105 whichrenders exact solvers such as Cholesky decomposition or methods based on for instanceStrassen’s algorithm infeasible. The Conjugate Gradient (CG) method [42] yields a so-lution with a time complexity which is linear in the number of non-zero elements in thesystem, depending on the condition number of the left-hand side. Therefore CG is usedto efficiently solve the system.

2.5.5 The Label Function

The appearance of the label function yc is chosen. Between the different frames thefunction changes only in translation. It should produce a single sharp peak resulting ineasy localization and attenuation of false detections during training. No other conditionsare used. For mathematical convenience we will look at a one-dimensional case. This one-dimensional label function yc can be extended to a two-dimensional version as yc(t1, t2)= yc1(t1)y

c2(t2).

An approximate Gaussian function with period T is picked,

yc(t) =

∞∑k=−∞

zc(t+ nT ) (2.40)

where

zc(t) = e−(t−qc)2/2σ2 . (2.41)

Here qc is the position of the target in time instance c. The Fourier transform is found as

ẑc(ω) = σ√

2πe−iωqc

e−ω2σ2/2 (2.42)

by using the Fourier transform for the case qc = 0 (see Appendix A.4) and the translationproperty of the Fourier transform. By defining ŷc(t) as a periodic summation we can usePoisson’s summation formula to find the Fourier coefficients

28 2 Background

ŷc[k] =1

Tσ√

2π exp

(−2π

2σ2

T 2k2 − i2πq

c

Tk

). (2.43)

A derivation is available in Appendix A.5.

2.5.6 The Interpolation Function

The interpolation function bd(t) which transfers the samples to the continuous domain, ispicked as the cubic spline

b(t) =

(a+ 2)|t|3 − (a+ 3)t2 + 1 |t| ≤ 1a|t|3 − 5at2 + 8a|t| − 4a 1 < |t| ≤ 20 2 < |t|

(2.44)

where a is a constant. We follow the original paper and use a = −0.75 in our experiments.The Fourier transform

F{bd}(ω) = (6a+ 12) 1ω4− 12 1

ω4cos(ω)− 6a 1

ω4cos(2ω) + 8a

1

ω3sin(2ω) (2.45)

is straightforward but messy to calculate. The details are found in Appendix A.6.The origin of the interpolation kernel is then rescaled and shifted half a sampling

period T/(2Nd) to align it with the center of the feature maps

b̃d(t) = b

(Nd

T

(t− T

2Nd

))(2.46)

which has a Fourier transform

ˆ̃bd(ω) =

T

Nde−i

T

2Ndω b̂

(T

Ndω

). (2.47)

Finally we define the periodic interpolation kernel

bd(t) =

∞∑n=∞

b(t− nT ) (2.48)

which, using the same reasoning as in (A.34), yields the Fourier coefficients

b̂d[k] =1

Tˆ̃bd(

2πk

T

)=

1

Nde−i2πk/2N

d

b̂

(2πk

Nd

). (2.49)

Here we have again reasoned for a one-dimensional case. A two-dimensional interpola-tion function can be defined as bd[k1, k2] = bd[k1]bd[k2].


2.5.7 Projection Estimation for Features

Many applications for visual tracking have a real-time constraint. The described learningframework performs considerable work every frame. It was recently proposed to reducethe number of feature dimensions D by estimating a projection matrix [14] which canthen be applied to the feature maps. The projection estimation will only be done in thefirst frame where a single sample has been received. The motivation to this is two-fold.Intuitively this should be good enough as the same object will be tracked and the relevantfeatures should remain about the same. Furthermore, doing this for every sample wouldincrease the method complexity. Introducing this strategy to the previously describedcontinuous formulation alters the initial filter training process. During subsequent frameshowever, only the samples change as they are projected to a new, smaller sample space.

Adding a Projection Matrix

We define this projection P =(p1 . . . pD

)as aD′×D matrix whereD is the number

of projected feature dimensions and D′ > D is the number of feature dimensions of theinput feature maps. Each pd is a vector used to project a sample onto feature dimensiond. We define a new operator for the detection scores as

SPf{x} =D∑d=1

fd ∗ (PTJd{xd}). (2.50)

Also denote z = Jd{xd}, with Fourier coefficients

ẑd[k1, k2] = b̂d[k1, k2]DFT{xd}[k1, k2] (2.51)

and by treating the sample x and filter f as D′- and D-dimensional vectors of elementsin L2(T ), we can write the Fourier coefficients of the score as

ŜPf{x} =D∑d=1

fd · (PTJd{xd}) = ẑTP f̂. (2.52)

Define regularization term for this projection

�P (P ) = ‖P‖F (2.53)

where the ‖ · ‖F is the Frobenius norm. Adding the regularization to the objective func-tional yields the new objective

�(f, P ) = ‖ẑTP f̂ − ŷ‖2`2 +D∑d=1

‖ŵ ∗ f̂d‖2`2 + λP ‖P‖2F . (2.54)

30 2 Background

Unlike the previously mentioned objectives this is non-linear. It is therefore optimizedwith the iterative Gauss-Newton algorithm. The estimated parameters in step i are denoted(f̂ i, P i) and are updated as f̂ i+1 = f̂ i + ∆f̂ and P i+1 = P i + ∆P . Each step is derivedby using Taylor’s theorem to get

ẑT (P i + ∆P )(f̂ i + ∆f̂) ≈ ẑTP i(f̂ i + ∆f̂) + ẑT∆P f̂ i (2.55)

which yields the linear least squares problem

min(∆f̂ ,∆P )

‖ẑTP i(f̂ i + ∆f̂) + ẑT∆P f̂ i − ŷ‖2`2 +D∑d=1

‖ŵ ∗ (f̂ id + ∆f̂d)‖2`2 + λP ‖P i + ∆P‖2F .

(2.56)

By finding the normal equations the Gauss-Newton step can be done efficiently with theconjugate gradients method.

Optimization

Define

AP =

01 . . . 0D

diag

ẑ[−K11 ,−K12 ]Tp1

...ẑ[−K11 ,K12 ]Tp1

...ẑ[K11 ,K

12 ]Tp1

. . . diagẑ[−KD1 ,−KD2 ]TpD

...ẑ[−KD1 ,KD2 ]TpD

...ẑ[KD1 ,K

D2 ]

TpD

01 . . . 0D

(2.57)

where 0d is a zero matrix which pads the feature channels of lower resolution. It is of size2(K1K2−Kd1Kd2 )+(K1−Kd1 )+(K2−Kd2 )×(2Kd1 +1)(2Kd2 +1) where (Kd1 ,Kd2 ) is thenumber of Fourier coefficients used for the projected feature dimension d, and (K1,K2)is the maximum number of Fourier coefficients used for any feature dimension. Define thevectorizations p =

(pT1 . . . p

TD

)Tand ∆p =

(∆pT1 . . . ∆p

TD

)T. Furthermore

define

f̂ =

f̂1f̂2...

f̂D

, f̂d =

f̂ id[−Kd1 ,−Kd2 ] + ∆f̂d[−Kd1 ,−Kd2 ]...

f̂ id[−Kd1 ,Kd2 ] + ∆f̂d[−Kd1 ,Kd2 ]...

f̂ id[Kd1 ,K

d2 ] + ∆f̂d[K

d1 ,K

d2 ]

(2.58)

and


Bf =(B1,1 . . . B1,D′ . . . BD,D′

), Bd,d′ =

ẑd′ [−K1,−K2]f̂ id[−K1,−K2]...

ẑd′ [−K1,K2]f̂ id[−K1,K2]...

ẑd′ [K1,K2]f̂id[K1,K2]

.(2.59)

We can then rewrite the optimization problem into

minf̂ ,∆p

‖AP f̂ +Bf∆p− ŷ‖22 + ‖W f̂‖22 + λP ‖p + ∆p‖22 (2.60)

with normal equations

(AHP AP +W

HW AHP BfBHf AP B

Hf Bf + λI

)(f̂

∆p

)=

(AHP ŷ

BHf ŷ − λp

). (2.61)

For a full derivation, see Sec. A.3. As earlier stated, the Gauss-Newton method is em-ployed to find an optimal projection matrix P and filter coefficients f̂ and in each step aleast squares subproblem is handled by letting method of Conjugate Gradients solve thenormal equations.

Optimization Problem Size

It is worth to mention the size of the described optimization problem. The vectorization(2.38) of the objective functional (2.29) yields the desired linear least squares problem,but one of a large size. For the baseline, or when using a single subfilter, a typical sizeof the matrix A is 300000 × 180000. The multiplication AHA would be impossibleto perform. To make the problem feasible, its sparsity is utilized. The matrix A has avery clear structure with the majority of elements being zero. Furthermore, no matrixmultiplications are performed, but rather vector-matrix multiplications. As a last speedup,only half of the Fourier coefficients are utilized, as the Fourier transform of a real-valuedfunction is even.

Baseline

The C-COT tracker [13] using two of the features introduced with ECO [14] is used as abaseline in this thesis. It utilizes the previously described feature projections, and updatesthe filter every six frames. In this thesis the baseline is extended, and compared with.

3Method

The tracking framework described in chapter 2 contains an assumption that the target isrigid and that it does not rotate. It is based on a discriminative correlation filter whichis a rigid template, matching the target. However, targets are rarely rigid in practice.An example of a common target is a human, which may walk, bend, crouch or rotate,violating the assumptions associated with discriminative correlation filters. These filtersare based on correlation and can in part handle both model errors and noise. This impliesrobustness to such transformations when large enough parts of the target remains similar.The current state-of-the-art trackers rely on invariant enough features to further improverobustness. Especially deep features extracted from classification networks can recognizeobjects undergoing significant transformations. Several sequences do however show thatthis is not enough. It is desirable to introduce deformability into the target model.

Staple [4] shows the viability of doing so. They add an additional component to thetarget model, a histogram of colors contained in the target patch. This component doesnot rely on any spatial structure and can therefore handle arbitrary deformations. Anotherapproach, inspired by the part-based models by Felzenszwalb et al. [18], is to use mul-tiple filters, each tracking a part of the target. By allowing the different filters to moverelative each other deformability is introduced. This approach can improve performance[31][29][33] and is usually coupled with some component merging the information pro-vided by the filters. Either there is some kind of weight on the filters related to the pre-diction variance, or some kind of component or regularization enforcing some geometricstructure.

The aim is to introduce a deformable enough model. The part-based approaches in-troduces some deformability, but each part is rigid. This could be problematic as sometargets cannot be divided into rigid parts, such as octopuses or gymnasts. Optical flowtends to be found in a coarse-to-fine manner. Inspired by this, an approach to deformDCFs could be to repeatedly construct each filter from a set of even smaller filters, untildesired accuracy is reached. That is, the resulting filter consists of several smaller filters.Each of the smaller filters in turn consists of several even smaller filters.

33

34 3 Method

The approach described in this chapter introduce multiple filters into an existing state-of-the-art learning framework. This is a first step towards the final aim to introduce defor-mations into DCF-based approaches such as the continuous convolution operators. Theproposed method utilizes several filters, denoted subfilters, each tracking a part of thetarget. The tracker is trained by solving a single objective functional. Deformability isintroduced into the learning framework described in the background chapter, preservingnotation and strategies with some changes and redefinitions. The actual implementationis an extension to the state-of-the-art tracker, C-COT, or its descendant tracker ECO.

3.1 Creating a Deformable Filter

We want to introduce deformability into the tracker. An optical flow-like method trans-forming a target mask from one frame into the subsequent mask was proposed in [44]. Itallows for arbitrary deformations, albeit has not been tested for the challenging trackingbenchmarks. A related approach is to track feature points of the target into the next frame.Each feature point may however be difficult to track in challenging scenarios as a featurepoint relies on a very small spatial support. DCF-based approaches with one template forthe target would in many cases be more robust, as their spatial support can include the en-tire target patch. If the large enough parts of the target appearance remain similar betweenframes, the DCF-based approaches can successfully track. The part-based approaches donot allow these almost arbitrary deformations, but do support commonly seen piece-wiserigid deformations.

In section, deformability is introduced into the tracker by rewriting the filter as alinear combination of several small subfilters, each keeping track of different parts ofthe target. This is inspired by other part-based approaches, but in contrast to these, fewadditional components are added. One unified objective is minimized. Furthermore, thefinal aim is not to separate a target into a few equally sized rigid parts, but rather toincorporate a larger set of transformations into the target model. The filter can be seenas a function which is described in a basis of subfilters. Recursively, each subfilter couldalso consist of a basis of subfilters. The subfilters are viewed as basis functions used bysome filter. During detection, the filter position is first updated, also moving the subfilters.The subfilter positions are then refined in the filter training stage where they are allowedto move relative each other.

The filter is defined as

f(t1, t2) =

M∑m=1

τqmfm(t1, t2) (3.1)

where fm ∈ L2T (R2)D is the subfilter, and τqm is a translation operator correspondingto subfilter m. M is the total numbers of subfilters. The translation operator translatessubfilter m from the filter center to its position qm = (qm1 , q

m2 ) . It is defined as

τqmfm(t1, t2) = f

m(t1 − qm1 , t2 − qm2 ). (3.2)

3.2 Objective Functional 35

The filter is a linear combination of third order tensors, added element-wise, as fm havetwo spatial dimensions and one over the feature channels. The detection score operatorbecomes

Sf{x} =D∑d=1

M∑m=1

(τqmfmd ∗ zd) (3.3)

where we utilize objects described in the theory chapter, with zd being the element corre-sponding to dimension d in the vector of continuous functions

z = PTJ{x}. (3.4)

Here PT is again a projection matrix, and J an interpolation operator element-wise ap-plied to each feature dimension of the input sample x. Note that z is used as a continuoussample, in many ways analog to x. It is an interpolated and projected sample. This isdescribed in Sec. 2.5.7. Also define a subscore operator Sfm : X → L2T (R) as

Sfm{x} =D∑d=1

(fmd ∗ zd) (3.5)

allowing for a simple description of the score operator as a linear combination of sub-scores

Sf{x} =M∑m=1

τqmSfm . (3.6)

Here we utilized the translation invariance of convolution and switched the order of sum-mation.

3.2 Objective Functional

The new filter formulation yields a slightly altered objective functional to be minimizedduring filter training. Define

�(f, q) = �1(f, q) + �2(f) + �3(q) (3.7)

as a single joint objective functional of three terms, each of which defined shortly. Duringfilter training the objective is minimized using an alternate optimization strategy.

Classification Error

The first term corresponds to the error of the detection scores. Ideally the score operatorsshould provide a single sharp Gaussian for each sample, segmenting the target from the

36 3 Method

Figure 3.1: Shows a heatmap overlaid on the input image. The heatmap displaysthe mean absolute value of the filter functions over the different dimensions, for onesubfilter. Note that the image to the left, and in the middle, shows a smaller spatialsupport due to a tighter spatial regularization. The filter in the right image has a lesstight spatial regularization which results in a larger spatial support. This illustrateshow subfilters can track smaller patches, where the baseline would track the wholeobject.

Figure 3.2: Three translated detection subscores τqmSfm{x} yielding three smallpeaks (top and left), resulting in the total detection score Sf{x} to the bottom right.

background. The definition

�1(f, q) =

∥∥∥∥∥(

C∑c=1

Sf{xc}

)− yc

∥∥∥∥∥2

2

(3.8)

3.3 Optimization Scheme 37

provided in the original C-COT framework is kept, but uses the new score operator.

Spatial Regularization

The second term induces the spatial regularization of the filter coefficients, as introducedby the SRDCF tracker. The regularization is applied to each subfilter as

�2(f, p) =

M∑m=1

D∑d=1

‖wm · fm‖22 (3.9)

enforcing each subfilter to have its larger coefficients close to the subfilter center. Herewm ∈ L2T (R)D is the continuous spatial regularization function. Note that the regulariza-tion can be different for different subfilters. This adds some flexibility as it for instancewould allow to use one subfilter as a kind of root-filter, tracking the whole target, whilethe other subfilters track small parts. Initial experiments showed that this strategy makessense as the added filters adds to the tracker complexity and may result in instability forthe most harsh cases of lighting variation and foreground contamination.

Regularization of Subfilter Positions

A third term is added to the objective functional described in Sec. 2.5. The subfilters willhave a smaller spatial support and hence fewer distinct features than any filter trackingthe whole target. Intuitively this may increase the likelihood of false detections appearingin the detection scores. Initial experiments of this thesis confirm this idea. In practice, de-formations are not random. There is structure inherent to the relative movement betweendifferent parts of a physical target, and this prior is added to the tracker as an additionalregularization term

�3(q) = λ3

C∑c=2

M∑m=1

‖qc,m −Rq1,m‖22. (3.10)

Here qc,m is the position of subfilter m in frame c, a two-dimensional vector. λ3 is someconstant and R is a linear transform. Some examples of this transform is an arbitrarylinear transform, a rotation with scaling, or an identity transform. The idea is to have thisregularization punish distance from the in some meaning closest transform of the typepicked. The transform can be jointly learnt in the optimization scheme. However, optimalsolutions can be found on closed-form for the listed transform types in terms of meansquare error. By using a closed form solution in each optimization step the distance toany transform of a certain type is punished, rather than the distance to a certain transformof the given type.

3.3 Optimization Scheme

The objective is minimized over the filter coefficients and the subfilter positions. To dothis an alternating optimization scheme is used. As optimizing the filter coefficients isa linear least squares problem, this can be done very efficiently with conjugate gradient.

38 3 Method

Minimizing the objective over the subfilter positions is in contrast a highly non-linearproblem. Therefore the positions are found using gradient descent, with the adaptive steplength introduced by Barzilai-Borwein’s method [2].

Section 3.3.1 reformulates the objective (3.7) with Parseval’s formula. This is similarto standard DCF formulations where this is done for efficiency reasons as the filter trainingreduces to solving a system of linear equations. For the continuous formulation this is alsonecessary to attain a finite representation of the filters. Section 3.3.2 rewrites the objectivesuch that it can be minimized over the filter coefficients via its normal equations using theconjugate gradient method. This method is highly efficient, something necessary as thenumber of filter coefficients is very large. However, it relies on the fact that the objectiveis linear in its filter coefficients, which is not the case for the subfilter positions. They willinstead be optimized separately. This process is described in Sec. 3.3.3.

3.3.1 Fourier Domain Formulation

To optimize over the continuous functions a Fourier basis is used. The scheme introducedwith C-COT is utilized, adapted to the multiple filter approach. The first term in theobjective is rewritten with its Fourier coefficients using Parseval’s formula,

�1(f, p) =

C∑c=1

αc‖Ŝf{xc} − yc‖2`2 (3.11)

where the linearity of the Fourier transform yields

Ŝf{xc} =M∑m=1

τ̂qc,m ̂Sfm{xc}. (3.12)

Here τ̂qc,m is an operator in the Fourier domain corresponding to the translation operatorτqc,m in the spatial domain defined such that

τ̂qc,m ̂Sfm{xc}[k1, k2] = ei2πqc,m1 k1/T1ei2πq

c,m2 k2/T2 · ̂Sfm{xc}[k1, k2], (3.13)

where qc,m = (qc,m1 , qc,m2 ) is the position of subfilterm in frame c relative the filter center.

The Fourier coefficients for the subscores are found as

̂Sfm{xc} =D∑d=1

f̂md ẑcd (3.14)

where ẑd is the Fourier coefficients of the interpolated and projected sample for featuredimension d. Parseval’s formula changes the second term of the objective into the convo-lutions

�2(f) =

M∑m=1

D∑d=1

‖f̂md ∗ ŵm‖2`2 . (3.15)

The importance of picking an efficient spatial regularization becomes apparent here. Con-volution is an expensive operation performed repeatedly during optimization, overcome

3.3 Optimization Scheme 39

by picking a spatial regularization which can be represented with few Fourier coefficients.Minimization of these two term is a quadratic optimization problem and can thereforebe solved with very efficient optimization schemes. The last term of the objective �3(p)however, is a highly non-linear term and cannot be subject to these schemes. It is in-stead minimized in a later stage using gradient descent with Barzilai-Borwein’s dynamicmethod.

3.3.2 Filter Training

The objective is minimized over the filter coefficients by solving its normal equationswith the method of Conjugate Gradients. As a finite representation is needed, truncatethe coefficients using the first Kd along each frequency dimension, resulting in (2Kd +1) · (2Kd + 1) coefficients in total. Denote ·H as the conjugate transpose of any matrix.Define a block matrix with C ×MD blocks

A =

A1...AC

, Ac = (Ac,1 . . . Ac,M) , Ac,m = (Ac,m,1 . . . Ac,m,D)(3.16)

where Ac,m,d is a diagonal matrix of size (2K + 1) · (2K + 1)× (2Kd + 1) · (2Kd + 1)

Ac,m,d = diag

τ̂qc,m [−Kd,−Kd]ẑcd[−Kd,−Kd]...

τ̂qc,m [−Kd,Kd]ẑcd[−Kd,Kd]...

τ̂qc,m [Kd,Kd]ẑcd[K

d,Kd]

. (3.17)

Further define

f̂ =

f̂1...f̂M

, f̂m = f̂m,1...

f̂m,D

, f̂m,d =

fmd [−Kd,−Kd]...

fmd [−Kd,Kd]...

fmd [Kd,Kd]

(3.18)

and

ŷ =

ŷ1...ŷC

, ŷc =

ŷc[−Kd,−Kd]...

ŷc[−Kd,Kd]...

ŷc[Kd,Kd]

(3.19)

Lastly, Γ denotes the diagonal matrix of size CK × CK applying αc to the sample fromfram

Documents

Visual Tracking with Deformable Continuous Convolution ...liu.diva-portal.org/smash/get/diva2:1111930/FULLTEXT01.pdf · This thesis studies the computer vision problem of visual tracking