Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Heuristic Approach for Robust Visual
Object Tracking
Ahmad Ali
Submitted in partial fulfillment of the requirements for the degree of Ph.D.
August, 2015
Department of Computer and Information Sciences
Pakistan Institute of Engineering and Applied Sciences
P.O. Nilore, Islamabad, Pakistan
Taught man that which he knew not. (Al-Quran)
Thesis Examiners
Student’s Name: Ahmad Ali Department: DCIS
Registration Number: 03-7-1-029-2010 Date of Registration: 11-10-2010
Thesis Title: Heuristic Approach for Visual Object Tracking
Foreign Reviewers (Names and Affiliations)
1. Professor Mihran Tuceryan, School of Science, Indiana University- Purdue University
2.Professor Nie Jian-Wei, Beihang University
3. Professor Chris Chatwin, University of Sussex
Thesis Defense Examiners (Names and Affiliations)
1. Engr. Dr. Shahzad Khalid, Bahria University
2. Dr. Ijaz Mansoor Qurashi, Air University
3. Dr. Sikander Majeed Mirza, PIEAS
Head of the Department (Name) :___________________________Signatures/Date ______________
Thesis Submission Approval
This is to certify that the work contained in this thesis entitled Heuristic Approach
for Visual Object Tracking, was carried out by Ahmad Ali, and in my opinion, it is
fully adequate, in scope and quality, for the degree of Ph.D. Furthermore, it is hereby
approved for submission.
Supervisor: _____________________ Name: Dr. Abdul Jalil Date: 19 August, 2015 Place: PIEAS, Islamabad.
Head, Department of Computer and Information Sciences: _____________________ Name: Dr. Javaid Khurshid
Date: 19 August, 2015 Place: PIEAS, Islamabad.
Heuristic Approach for Robust Visual
Object Tracking
Ahmad Ali
Submitted in partial fulfillment of the requirements for the degree of Ph.D.
August, 2015
Department of Computer and Information Sciences
Pakistan Institute of Engineering and Applied Sciences
P.O. Nilore, Islamabad, Pakistan
Dedications
I dedicate my thesis to my late father. He died in May, 2015 after fighting with his
disease for four years. He motivated me to pursue my Ph.D. in PIEAS in 2010. May
he get eternal peace in his world.
ii
Acknowledgements
All praises to ALLAH (S.W.T.), the creator of everything, for blessing us with
knowledge and endowing the status of super creature. I am always grateful to
almighty ALLAH, the most benevolent and merciful, who blessed me throughout my
life despite my limitations, and gave me the ability to undertake such a challenging
task and proceed towards completion.
I extend my sincerest thanks and the deepest appreciation to my supervisor,
Dr. Abdul Jalil for his generous guidance and moral support during my Ph.D. His
valuable suggestions and persuasive criticism have led me to complete my goal
successfully.
A very special note of thanks goes to my parents and my wife, whose
heartfelt prayers, appreciation, and support have always been a valuable asset and a
great source of inspiration for me.
I am also indebted to Dr. Javed Ahmed, Mr. Khalid Akbar for their
cooperation and encouragement to attain my goal. Thanks are due to Mr. Imran
Khan, Mr. Naveed Haq whose encouragement led me to successful completion of
this thesis.
I gratefully acknowledge Mr. Naeem Ahmed, project director of IT and
Telecom Endowment fund, PIEAS. It was his gratifying attitude that sets me free
from my financial worries throughout Ph.D. He really deserves special thanks for his
generous support.
Last, but not the least, I would like to thank my other Ph.D. fellows (Mr.
Adnan Idris, Mr. Mehdi Hassan, Mr. Muhammad Tahir, Mr. Khurram Jawad,
Mr. Nasir, Ms. Saima Rathore, Mr. Muhammad Aksam Iftikhar, and Mr.
Gibran Javed). These colleagues and friends helped me in times of troubles, praised
me on my achievements, and cheered me whenever I was depressed.
Ahmad Ali
iii
Declaration of Originality
I hereby declare that the work contained in this thesis and the intellectual content of
this thesis are the product of my own work. This thesis has not been previously
published in any form nor does it contain any verbatim of the published resources
which could be treated as infringement of the international copyright law. I also
declare that I do understand the terms ‘copyright’ and ‘plagiarism,’ and that in case of
any copyright violation or plagiarism found in this work, I will be held fully
responsible of the consequences of any such violation.
__________________ Ahmad Ali
19 August, 2015 PIEAS, Islamabad.
iv
Copyrights Statement
The entire contents of this thesis entitled Heuristic Approach for Visual Object
Tracking by Ahmad Ali are an intellectual property of Pakistan Institute of
Engineering & Applied Sciences (PIEAS). No portion of the thesis should be
reproduced without obtaining explicit permission from PIEAS.
v
Table of Contents
Dedications .................................................................................................................... ii
Acknowledgements ...................................................................................................... iii
Declaration of Originality ............................................................................................. iv
Copyrights Statement ..................................................................................................... v
Table of Contents .......................................................................................................... vi
List of Figures ............................................................................................................... ix
List of Tables ............................................................................................................. xiii
List of Algorithms ....................................................................................................... xiv
Abstract ........................................................................................................................ xv
List of Publications .................................................................................................... xvii
1 Introduction ............................................................................................................ 1
1.1 Issues of Visual Object Tracking ....................................................................... 2
1.2 Motivation and Objective .................................................................................. 4
1.3 Contributions of Thesis ...................................................................................... 5
1.4 Thesis Organization ........................................................................................... 6
1.5 Chapter Summary .............................................................................................. 6
2 Literature Survey ................................................................................................... 7
2.1 Related Surveys ................................................................................................. 8
2.2 Contribution to Existing Surveys ....................................................................... 8
2.3 Classical Tracking Approaches.......................................................................... 9
2.3.1 Mean Shift for VOT .................................................................................. 9
2.3.2 Kalman Filter for VOT ........................................................................... 12
2.3.3 Correlation based Template Matching .................................................... 16
2.3.4 Motion Detection for Tracking ............................................................... 18
vi
2.4 Contemporary Tracking Approaches ............................................................... 20
2.4.1 Tracking by Detection............................................................................. 21
2.4.2 Particle Swarm Optimization .................................................................. 23
2.4.3 Sparse Representation ............................................................................. 25
2.4.4 Integration of Context Information ......................................................... 26
2.5 Evaluation Methods for VOT Algorithms and Benchmark Resources ........... 27
2.6 Chapter Summary ............................................................................................ 30
3 Proposed Template Updating Method ................................................................. 32
3.1 Correlation based Template Updating Methods .............................................. 32
3.1.1 Traditional Template Updating Methods ................................................ 33
3.2 Proposed Template Updating Method ............................................................. 34
3.2.1 Case 1 ...................................................................................................... 36
3.2.2 Case 2 ...................................................................................................... 37
3.2.3 Case 3 ...................................................................................................... 37
3.3 Results and Discussion .................................................................................... 37
3.3.1 Qualitative Analysis ................................................................................ 38
3.3.2 Quantitative Analysis .............................................................................. 40
3.4 Chapter Summary ............................................................................................ 42
4 Proposed Visual Tracking Method ...................................................................... 44
4.1 Related Work ................................................................................................... 44
4.2 Proposed Visual Object Tracking Framework ................................................. 46
4.2.1 Correlation and KF based Tracking ........................................................ 46
4.2.2 Adaptive Threshold ................................................................................. 48
4.3 Occlusion Handling with Kalman Filter .......................................................... 49
4.4 Adaptive Fast Mean Shift Algorithm ............................................................... 50
4.5 Combining Correlation, Kalman Filter and Adaptive Kernel Fast Mean Shift
Algorithms ................................................................................................................ 51
vii
4.6 Results and Discussion .................................................................................... 54
4.6.1 Data Set ................................................................................................... 55
4.6.2 Analysis for Proposed Tracking Algorithm ............................................ 56
4.6.3 Adaptive Threshold with Different Parameter Values ............................ 57
4.6.4 Comparison of Proposed Tracking Method with Its Constituents .......... 58
4.6.5 Performance Comparison of Proposed Tracking Methods with Other
Methods............................................................................................................... 59
4.7 Chapter Summary ............................................................................................ 63
5 Stabilized Active Camera Tracking System ........................................................ 70
5.1 Pan-Tilt Control ............................................................................................... 71
5.2 Video Stabilization........................................................................................... 72
5.3 Proposed Pan-tilt Control Algorithm ............................................................... 73
5.4 Proposed Video Stabilization Algorithm ......................................................... 74
5.5 Results and Discussion .................................................................................... 79
5.5.1 Performance of Stabilization Algorithm ................................................. 79
5.5.2 Performance of Active Camera Tracking System .................................. 81
5.5.3 Performance of Stabilized Active Camera Tracking System ................. 86
5.6 Chapter Summary ............................................................................................ 86
6 Conclusion and Future Work ............................................................................... 89
6.1 Summary .......................................................................................................... 89
6.2 Future Work ..................................................................................................... 90
REFERENCES ................................................................................................................ 92
viii
List of Figures
Figure 1.1 Different applications of visual object tracking ........................................... 1
Figure 1.2 Different issues that arise during tracking .................................................... 3
Figure 2.1 Different classical as well as contemporary approaches for visual object
tracking . ........................................................................................................................ 8
Figure 2.2 (Up) Normal tracking, estimated position by Kalman filter follows the
measured position, (Down) Tracking during occlusion using Kalman filter ............... 14
Figure 2.3 (Source [1]): Adaptive tracking-by-detection process, i.e., tracking the
target and updating the classifier. ................................................................................ 21
Figure 2.4 Positive and negative samples for online AdaBoost [2] ............................. 22
Figure 2.5 Positive and negative bags for MIL classifier [3] ...................................... 22
Figure 2.6 A few tracked frames of Liquor video sequence. The yellow rectangle
shows the tracked window, the more closely to the target, the better the result. ......... 28
Figure 3.1 Comparison of different updating schemes (i.e., Naive, α, and β methods
shown in first three rows, respectively) with the proposed method (i.e., forth row) for
Girl video. The video involves two times out-of-plane rotation of the target (see
Frames 101 and 211). The proposed method updates the template better than any of
these methods, and minimizes the template drift. ........................................................ 38
Figure 3.2 Comparison of different updating schemes (i.e., Naive, α, and β methods
shown in first three rows, respectively) with the proposed method (i.e., forth row) for
Woman video which contains occlusions, appearance change of the target, clutter and
illumination change in the scene. It is clear that the proposed method works better
than the methods in comparison. ................................................................................. 39
Figure 3.3 Comparison of different updating schemes (i.e., Naive, α, and β methods
shown in first three rows, respectively) with the proposed method (i.e., forth row) for
Faceocc video The proposed method successfully handles slow occurring long term
occlusion. ..................................................................................................................... 40
Figure 3.4 Center distance error between ground truth value and calculated value by
naive, alpha, beta, and the proposed template updating methods for Girl video. The
template drift is much less by the proposed method. ................................................... 41
ix
Figure 3.5 Center distance error between ground truth value and calculated value by
naive, alpha, beta, and the proposed template updating methods for Woman video.
The template drift is much less by the proposed method............................................. 41
Figure 3.6 Center distance error between ground truth value and calculated value by
naive, alpha, beta, and the proposed template updating methods for Faceocc video.
The template drift is much less by the proposed method............................................. 42
Figure 4.1 Proposed Tracking Algorithm .................................................................... 52
Figure 4.2 Comparison of results for simple correlation tracker, correlation and KF
tracker, and adaptive fast mean shift embedded with correlation and KF tracker for
ThreePastShop2cor video (from Caviar dataset). It proves the claim that adding mean
shift approach with correlation and KF tracker (in the proposed way) improves the
results. .......................................................................................................................... 56
Figure 4.3 Comparison of results for simple correlation tracker, correlation and KF
tracker, and adaptive fast mean shift embedded with correlation and KF tracker for
Liquor video. It proves the claim that adding mean shift approach with correlation and
KF tracker (in the proposed way) improves the results. .............................................. 56
Figure 4.4 Comparison of Pascal score of correlation KF tracker with and without
adaptive fast mean shift algorithm ............................................................................... 57
Figure 4.5 Comparison of mean distance error of correlation KF tracker with and
without adaptive fast mean shift algorithm .................................................................. 58
Figure 4.6 Center distance error for Box video sequence ............................................ 62
Figure 4.7 Pascal Score for Box video sequence ......................................................... 63
Figure 4.8 Distance Score for Board video sequence .................................................. 64
Figure 4.9 Pascal Score for Board video sequence...................................................... 65
Figure 4.11 Pascal Score for Liquor video sequence ................................................... 66
Figure 4.10 Distance Score for Liquor video sequence ............................................... 66
Figure 4.12 Sample tracked frames of Box video sequence. The proposed algorithm
successfully tracks the target during occlusions, scale changes, 3D motion causing
blurriness, and clutter background. .............................................................................. 67
Figure 4.14 Results for Board video sequence. The proposed algorithm successfully
handles the out of plane motion of the target in cluttered background. ....................... 67
Figure 4.13 A few tracked frames of Liquor video sequence. The proposed approach
successfully tracks during occlusions, 3D motion causing blurriness, and background
clutter. .......................................................................................................................... 67
x
Figure 4.15 Frames of Car video sequence. The proposed algorithm successfully
tracks the target in low light conditions. ...................................................................... 68
Figure 4.16 Some frames from David video sequence. The proposed algorithm tracks
the target in varying illuminations and appearance changes. ...................................... 68
Figure 4.17 A few frames of Faceocc2 video sequence. The proposed algorithm
tracks the target with large appearance changes and slowly occurring heavy
occlusions. .................................................................................................................... 68
Figure 4.18 A few frames of Singer video sequence. The proposed algorithm
successfully handles high illumination effects as well as large scale changes. ........... 69
Figure 4.19 Some tracked frames from the sequence ThreePastShop2Cor2 (Caviar
dataset). The main challenges in the video include the existence of similar objects,
and the occlusions which occur while the persons in the sequence cross each other.
The proposed method successfully tracks the target. ................................................... 69
Figure 5.1 Simplified block diagram of the proposed stabilized active camera tracking
system. ......................................................................................................................... 70
Figure 5.2 Relationship between α and cut-off frequency of the low-pass filter ......... 75
Figure 5.3 Magnitude of frequency response of the low-pass filter at α = 0.11 .......... 76
Figure 5.4 Original (left side) versus stabilized (right side) frames of a video recorded
from a vibratory flying helicopter ................................................................................ 77
Figure 5.5 Original versus stabilized x-coordinates of the left truck shown in Figure
5.4................................................................................................................................. 78
Figure 5.6 Original versus stabilized y-coordinates of the left truck shown in Figure
5.4................................................................................................................................. 79
Figure 5.7 Original (left side) versus stabilized (right side) frames of a video recorded
from a vibratory hovering helicopter ........................................................................... 80
Figure 5.8 Original versus stabilized x-coordinates of the building shown in Figure 5.7
...................................................................................................................................... 81
Figure 5.9 Original versus stabilized y-coordinates of the building shown in Figure 5.7
...................................................................................................................................... 82
Figure 5.10 A helicopter is being tracked persistently and precisely with the proposed
tracking system even when the user has initialized the template inaccurately, and the
size, the appearance, and the velocity of the helicopter is continuously varying. ....... 83
Figure 5.11 Tracking the face of a person during severe illumination variation, noise,
low detail, and occlusion. All the lights in the room were turned off in this experiment
xi
to create a challenging scenario. The dark yellow rectangle in Frame 495 indicates
that the tracker is currently working in its occlusion handling mode. ......................... 84
Figure 5.12 Results of un-stabilized (left column) vs. stabilized active camera tracking
(right column) of a distant airplane .............................................................................. 85
Figure 5.13 Results of un-stabilized (left column) vs. stabilized active camera tracking
(right column) of a pedestrian ...................................................................................... 87
xii
List of Tables
Table 2.1 Several related surveys .................................................................................. 7
Table 2.2 Comparison of different VOT algorithms using Mean Shift (S/M - Single
target or multiple target, O - occlusion, IV - high illumination variations, SV - sudden
and large change in target velocity, SC - scale change). Symbols √ and ⅹ,
respectively, show that algorithm does or does not handle the issue. .......................... 11
Table 2.3 Comparison of different VOT approaches exploiting KF (OS- optimum
search, O-occlusion, LM - large target movement, SV- sudden change in velocity).
Symbol √ shows that the tracking algorithm handles the issue and symbol ⅹ means it
does not tackle the issue. .............................................................................................. 16
Table 2.4 Comparison of different correlation metrics. ............................................... 18
Table 2.5 Representative work of tracking-by-detection technique. ........................... 23
Table 2.6 Representative work of using different variants of PSO in VOT ................ 24
Table 2.7 Representative work of exploiting context information for VOT ............... 27
Table 2.8 List of a few online publicly available tracking resources. ......................... 29
Table 3.1 Description of test videos ............................................................................ 37
Table 3.2 Mean center location error for test video sequences using naive, α, β, and
the proposed template updating methods. .................................................................... 42
Table 4.1 Description of dataset .................................................................................. 53
Table 4.2 Pascal score on test video sequences with different values of ψ ................. 54
Table 4.3 Mean distance error on test video sequences with different values of ψ ..... 55
Table 4.4 Comparison of correlation KF tracker with and without adaptive fast mean
shift algorithm .............................................................................................................. 59
Table 4.5 Mean center location error for video sequences of dataset .......................... 60
Table 4.6 Pascal VOC score for video sequences of dataset ....................................... 61
Table 5.1 Maximum steady state error of the tracker ................................................. 74
xiii
List of Algorithms
Algorithm 3.1 Proposed template updating method .................................................... 36
Algorithm 4.1 Correlation and Kalman filter tracking ................................................ 47
Algorithm 4.2 Adaptive threshold ............................................................................... 48
Algorithm 4.3 Occlusion handling with Kalman filter ................................................ 49
Algorithm 4.4 Adaptive fast mean shift algorithm ...................................................... 50
Algorithm 4.5 Combining correlation, Kalman filter and adaptive fast mean shift
algorithms .................................................................................................................... 51
xiv
Abstract
Visual Object Tracking (VOT) is an important field of computer vision which has a
number of applications in different fields, including military as well as commercially
available security and surveillance systems. The contribution of the thesis in this field
is many folds.
Firstly, a comprehensive survey of different classical and contemporary
approaches for VOT is presented. It enables swift understanding of old as well as new
trends in this field.
Secondly, a novel method for template (appearance model of the target)
updating is presented. It adaptively updates the template according to the rate of
change of target’s appearance. Comparison with existing template updating
techniques shows the robustness of the proposed template updating method against
the template drift as well as the stagnation to the old appearance problems.
Thirdly, a new approach for VOT is proposed which combines correlation,
Kalman filter and adaptive kernel fast mean shift algorithms, heuristically. Correlation
tracker is, generally, computation intensive (if the search space or the template is
large) and it suffers from the template drift problem. Moreover, it fails in case of fast
maneuvering target, rapid appearance changes, occlusion, and clutter in the scene.
These issues are handled by using the proposed template updating method and
Kalman filter (KF) with correlation tracker. The threshold for template updating is
made adaptive by using current peak correlation value in the proposed tracking
framework. KF predicts the target coordinates for the next frame, if the measurement
vector is supplied to it by a correlation tracker. Thus, a relatively small search space
can be determined where the probability of finding the target in the next frame is high.
This way, the tracker becomes fast and rejects the clutter which is outside the search
space in the scene. However, if the tracker provides wrong measurement vector due to
the clutter or the occlusion inside the search region, the efficacy of the filter is
significantly deteriorated. In this case, KF predicted position is far apart from the
correlation measured position. Similar situation arises, if a moving target suddenly
xv
changes its direction. In order to handle such scenarios, Fast Mean Shift (FMS) vector
is computed inside the difference image of two consecutive search windows to find
out the cluster of template size in it, which is considered as a target candidate. FMS
kernel is made adaptive according to varying size of target. The proposed tracker
considers the KF prediction position as the true target position if it is close to the FMS
generated position. Otherwise, the correlation measurement is followed. Comparison
with state-of-the-art tracking algorithms on publicly available standard datasets shows
that the proposed algorithm outperforms the other algorithms in most of the cases.
Fourthly, a stabilized active camera tracking system is presented. It comprises
of a camera mounted on a Pan-Tilt Unit (PTU) which is placed on a moving platform.
Jitters are produced in video from the camera due to vibrations in the moving platform
which may cause strains in the eyes of the viewer. The outcome of the proposed
tracking algorithm is employed to digitally stabilize the video without any significant
computational overhead. Experimental results show the efficacy of the proposed
algorithm.
Index terms – visual object tracking, template updating, video stabilization, Kalman
filter
xvi
List of Publications
• Ahmad Ali, Abdul Jalil, JianWei Niu, Xioke Zhao, Javed Ahmed, Muhammad
Aksam Iftikhar, Saima Rathore, “Visual Object Tracking- Classical and
Contemporary Approaches”, accepted in Frontiers of Computer Science, Springer,
Verlog, 2015.
• Ahmad Ali, Abdul Jalil, Javed Ahmed, Muhammad Aksam Iftikhar, Mutawarra
Hussain, “Correlation, Kalman Filter and Adaptive Fast Mean Shift based
Heuristic Approach for Robust Visual Tracking”, Journal of Signal, Image, and
Video Processing, Springer Verlog, pp. 1-19, Jan. 2014, doi: 10.1007/s11760-014-
0612-0.
• Javed Ahmed, Ahmad Ali, Asifullah Khan, “Stabilized Active Camera Tracking
System”, Journal of Real-Time Image Processing, Springer Verlog, pp. 1-20, May
2012, doi: 10.1007/s11554-012-0251-z.
• Irum Anayat, Rooh-ul-Amin, Ahmad Ali, “Moving Object Tracking in Video
Sequences: Moving Object Tracking in Video Sequences through Template
Matching, Fast Mean Shift and Kalman Filter”, Publisher: VDM Verlag Dr.
Muller, 2011, ISSBN: 978-3639377552.
• Muhammad Imran Khan, Javed Ahmed, Ahmad Ali, Asif Masood, “Robust
Edge-Enhanced Fragment Based Normalized Correlation Tracking in Cluttered
and Occluded Imagery,” International Journal on Advanced Science and
Technology, vol. 12, pp. 25-34, 2009, doi:10.1.1.359.7828.
• Ahmad Ali, Hameed Kausar, Muhammad Imran Khan, “Automatic Visual
Tracking and Firing System for Anti-Aircraft Machine Gun”, in Proc 6th
International Bhurban Conference on Applied Sciences & Technology (IBCAST),
2009.
xvii
• Ahmad Ali, Sikander Majeed Mirza, “Object Tracking using Correlation, Kalman
Filtering and Fast Mean Shift Algorithms”, in Proc International Conference on
Emerging Technologies (ICET), 2006.
• Ahmad Ali, Abdul Jalil, Javed Ahmed, Saima Rathore, Muhammad Aksam
Iftikhar, “A New Template Updating Method for Correlation Tracking”, to be
submitted soon.
• Muhammad Aksam Iftikhar, Abdul Jalil, Saima Rathore, Ahmad Ali, Mutawarra
Hussain, ”An Extended Nonlocal Means Algorithm: Application to Brain MRI”,
accepted in International Journal of Imaging Systems and Technology, Wiley,
2014.
• Ahmad Ali, Ilyas Butt, Asifullah Khan, “Browse-Back Post Event Analyzer”, in
Proc. Of IEEE Conference on Frontiers of Information Technology, 2011.
• Muhammad Aksam Iftikhar, Abdul Jalil, Saima Rathore, Ahmad Ali, Mutawarra
Hussain, ”Brain MRI Denoizing and Segmentation based on Improved Adaptive
Nonlocal Means”, International Journal of Imaging Systems and Technology,
Wiley, pp. 234-248, 2013.
• Saima Javed, Mutawarra Hussain, Ahmad Ali, Asifullah Khan, "A Recent Survey
on Colon Cancer Detection Techniques", IEEE/ACM Transaction of
Computational Biology and Bioinformatics, 2013.
• Saima Rathore, Aksam Iftikhar, Ahmad Ali, Mutawarra Hussain, Adul Jalil,
"Capture Largest Included Circles: An Approach for Counting Red Blood Cells",
in Emerging Trends and Applications in Information Communication
Technologies, Springer, pp. 373-384, 2012, ISBN: 978-3-642-28961-3.
• Saima Rathore, Madeeha Naiyar, Ahmad Ali, " Comparative study of entity and
group mobility models in MANETs based on underlying reactive, proactive and
hybrid routing schemes", in Proc. 15th IEEE International Multi Topic
Conference (INMIC), December 13-15, 2012.
xviii
1 Introduction
Visual Object Tracking (VOT) is a well-known research area in computer vision. Its
main objective is to find the locus of points that target of interest follows in image
coordinates. This information may be of significant importance for further analysis,
e.g., to calculate the area, perimeter, center of mass, and motion vector of the target,
etc. Thus, target tracking may play an important role in high level image analysis
tasks, e.g., object recognition [72, 73], activity analysis [5, 74], and intelligent scene
understanding [75]. With the easy accessibility of low cost, high performance
computing power and ubiquitously available digital cameras, usability spectrum of
VOT has become wider and it has found its applications in several real world systems.
A few of its applications are shown in Figure 1.1, which includes:
Human Machine Interaction (HMI): VOT plays an important role to
improve community life by providing easy-to-use interaction with machines, e.g.,
sixth-sense (a wearable gesture interface) [76], perceptual user interfaces [77], eye
gaze tracking for disabled people [78], etc.
Figure 1.1 Different applications of visual object tracking
VOT
Visual surveillance and security
systems Activitity
recognition
Video games
Vehicle tracking
Traffic montoring
Human machine
interaction
Idustrial robotics
Medical diagnosis system
1
Introduction
Visual Surveillance and Security Systems (VSSS): These systems are
ubiquitous in recent times and VOT is an important part of intelligent visual
surveillance, e.g., 3rd Generation Surveillance Systems (3GSS) [66], Siemens Sistore
CX EDS [79], surveillance of places and buildings related to public and defense
interests for intruder detection [80], monitoring human activities [81-86], etc.
Traffic Monitoring: VOT provides solution for monitoring and management
of traffic on roads, e.g., detection of traffic accidents [87, 88], counting of pedestrian
[89], etc.
Industrial Robotics: VOT is applied in the control system of industrial and
humanoid robotics, e.g., using vision sensor with tracking algorithm in feedback loop
[90], ASIMO humanoid robot [91], visual control for Unmanned Aerial Vehicle
(UAV) [92], etc.
Vehicle Tracking: VOT is used for automobile tracking, e.g., tracking a
vehicle by UAV [93], tracking vehicles on road to assist driver [94], [18], and
autopilot of UGV [95], etc.
Video Games: VOT is used in video games to provide better user control,
e.g., tracking user movements [96], face tracking for playing game [97], etc.
Medical Diagnosis Systems: VOT has shown its importance in the medical
field for diagnosis of different diseases, e.g., tracking of ventricular wall [98],
reconstruction of vocal tract shape [99], [100], etc.
Activity Recognition: VOT is an important component of activity recognition
systems for indoor and outdoor monitoring, e.g., learning activity patterns [101],
human activity recognition [102], etc.
1.1 Issues of Visual Object Tracking Immense efforts have been engaged by researchers in the field of VOT for the last
four decades [56], [103]. Nonetheless, it is still a nontrivial task due to various issues
as depicted in Figure 1.2. The issues are described as follows:
Occlusion: It is the state when the target is hidden (partially or fully) by
another object. Occlusion detection and handling is an important issue, but there is no
2
Introduction
universal technique to tackle it. Therefore, strategies are opted according to the nature
of the target and environment of tracking.
Appearance change: Most of the targets, especially non-rigid objects, change
their appearance during motion. Therefore, it is mandatory for the target model to
adapt these changes for a long term tracking session. Small inaccuracies include in the
target model during updating, which accumulate as time passes and ultimately result
in unstable tracking due to sliding off the template from target to background. This
issue is called template drift problem. On the contrary, if the model is made fixed, i.e.,
not updated, or slowly updated, the template cannot incorporate changes in the target
appearance and lose the target due to stagnation to the old appearance problem. Thus,
(a) Occlusion (b) Appearance change (c) Cluttered background
(d) Changing size in image (e) Illumination variations (f) Noise in image
(g) Similar objects (h) Complex object motion
Figure 1.2 Different issues that arise during tracking
3
Introduction
a trade-off between drift and stagnation is required, it is called stability vs. plasticity
dilemma [53].
Cluttered background: When the background of the target contains many
other objects, it is called cluttered environment. If the background is already known
(e.g., indoor tracking), it is easy to handle cluttered environment, but for unknown
background or outdoor tracking, the severity of the problem is increased.
Changing target size in Image: When the target moves towards or away
from the camera, its size in image increases or decreases, respectively. Therefore, the
size of the target appearance model is required to be changed, accordingly, for robust
tracking.
Illumination variations: Many features of the target which are prominent in
high luminance become obscure in low luminance, and vice versa. It causes
deteriorating tracking performance. Therefore, illumination change needs to be
handled for robust visual tracking.
Noise in Image: Image of the target scene may be noisy (e.g., electronic
circuit noise). Therefore, some preprocessing is required to remove the noise from the
image for robust tracking.
Similar Objects: When there are similar objects near the target,. It is likely
that appearance model has high matching score with the nearby objects and
discrimination between the target and the rest of the objects becomes tough
Complex object motion: When target motion is complex, such as out-of-
plane movement, or abrupt variations in its speed and direction (e.g., motion of a
fighter plane or motion of people during skating), tracking becomes a difficult task
due to the inexact approximation of the underlying motion model.
1.2 Motivation and Objective Technological advancement in the field of digital video cameras and continuously
increasing computation power has captured the attention of researchers and
developers to build different visual applications. VOT usefulness as well as usability,
as described in Section 1, is flourishing day and night. Nonetheless, it is a challenging
task in general due to missing prior information about the target and its environment.
4
Introduction
The main objective of this research is to propose an algorithm which can work
robustly in general, if faced with occlusion, clutter, changing target appearance,
illumination variations, etc.
1.3 Contributions of Thesis The contributions of the thesis are many fold including following:
A comprehensive summary of relevant literature, which introduces a new
taxonomy of VOT algorithms into classical and contemporary approaches, and
discussion of different tracking algorithms. This way, the reader may quickly
understand old as well as recent trends of VOT algorithms
A novel template updating method, which updates the template according to
the rate of appearance change of target. It tackles the drift as well as stagnation of old
appearance problems.
A new tracking method framework, which heuristically integrates three
elementary tracking algorithms, namely correlation tracker, Kalman filter, and mean
shift algorithms, in a selective and adaptive manner. The proposed tracking
framework includes: (1) adaptive method for updating the template size, appearance,
and search area, (2) heuristic technique for switching back and forth between
correlation measured output and Kalman filter predicted output based on the closeness
of the mean shift tracker output to either measured or predicted target’s position,
respectively, and (3) heuristic techniques for updating some of the thresholds
associated with different decision steps throughout the algorithm.
The stabilized active camera tracking system, which uses the tracking
algorithm on the video captured from the camera mounted on a pan-tilt unit, along
with its motion control algorithm for active tracking. The active camera tracking
system produces jitters in the video if it is fixed on a moving platform, e.g.,
Unmanned Aerial Vehicle (UAV), Unmanned Ground Vehicle (UGV), helicopter,
etc., therefore, video stabilization is required in order to provide ease to the user. A
new stabilization algorithm is presented in the thesis, which uses the tracking
algorithm to filter out the jitters in the video without adding any significant
computation overhead.
5
Introduction
1.4 Thesis Organization The rest of the thesis is organized into various chapters as follows.
Chapter 2 presents literature survey for VOT. It classifies tracking algorithms
in classical and contemporary approaches. The different techniques in each approach
are discussed. Moreover, tracking resources available online are presented. Thus, a
reader quickly gets the idea of conventional as well as modern trends in this field.
Chapter 3 proposes a new template updating method, which adapts different
variations in appearance of target, according to its rate of change. Experimental
results show that proposed updating method significantly avoids template drift as
compared to the other methods.
Chapter 4 describes the proposed tracking algorithm. It combines correlation,
Kalman filter and adaptive fast mean shift algorithm heuristically such that these
elementary algorithms complement each other for robust tracking. Experimental
results show the efficacy of the algorithm.
Chapter 5 presents stabilized active camera tracking system which consists of
Pan-Tilt Unit (PTU) for active tracking. Video becomes shaky if PTU is fixed on a
moving platform The proposed video stabilization method uses the tracking algorithm
to digitally stabilize the video without any significant computational burden.
Chapter 6 concludes the thesis and sums up the techniques presented for
template updating and robust visual tracking. Moreover, it discusses future directions
for VOT.
1.5 Chapter Summary This chapter discusses applications of VOT in different fields such as human-
computer interaction, industrial robotics, traffic monitoring, video games, vehicle
tracking, medical diagnosis system, and security and surveillance systems. Although,
a lot of work has been done in this field, but there exists no universal solution for
VOT due to the absence of any prior information of target and its background.
Moreover, there are many issues faced by tracking algorithms such as occlusion,
clutter, similar objects, noise, variations in lighting conditions, complex object
motion, changing target appearance and size, etc.
6
2 Literature Survey
By recent years, VOT has made significant progress due to the availability of low
cost, high quality video cameras as well as fast computational resources. Many
modern techniques have been proposed to handle the challenges faced by VOT. This
chapter introduces the readers to (1) various classical as well as contemporary
approaches for object tracking, (2) evaluation methodologies for VOT, and (3) online
resources, i.e., annotated datasets and source code available for various tracking
techniques.
Table 2.1 Several related surveys
Related Surveys Year Topic
Chau et al. [5] 2013
Tracking any object
Yilmaz et al. [12] 2006
Joshi et al. [20] 2012
Yang et al. [30] 2011
Cannons [35] 2008
Geronimo et al. [41] 2010
Pedestrian tracking Ogale et al. [47] 2006
Trucco et al. [51] 2006
Surveillance and motion analysis
Aggarwal et al. [55] 1997
Zhan et al. [58] 2008
Kang et al. [60] 2007
Arikan et al. [63] 2006
Kim et al. [66] 2010
Moeslund et al.[68] 2006
Arulampalam et al.
[69]
2002 Bayesian tracking
Jalal et al. [70] 2012 Wavelet for object tracking
Li et al. [71] 2013 Appearance Models
7
Literature Survey
2.1 Related Surveys Several surveillance and tracking related surveys can be found in literature as shown
in Table 2.1. The most of these surveys are old (i.e., of last decade), e.g., [12], [35],
[47], [51], [68, 69], [55], [60], [63], [58], etc; some cover only a specific field or
technique for tracking, (e.g., pedestrian tracking [41], Bayesian method [69], tracking
under sea water [51], wavelet for tracking [70], etc); a few discuss tracking within
different principle category (e.g., crowd analysis [58], human motion analysis [68],
intelligent visual surveillance [60], appearance model [71], etc); and the recent
surveys discuss only modern trends in VOT, e.g., [30], or recent algorithms using
variants of classical techniques, e.g., [5], [70].
2.2 Contribution to Existing Surveys The present survey discusses: (1) classical and contemporary approaches for visual
object tracking as shown in Figure 2.1, (2) comparison of different tracking
algorithms, and (3) online available resources for different tracking algorithms such as
source code, annotated dataset, etc. The survey will help the readers to briskly
understand the old as well as the current trends and approaches in visual object
tracking.
Figure 2.1 Different classical as well as contemporary approaches for visual object tracking .
VOT Classical
approaches
Mean shift
Kalman filter
Correlation based
template matching
Motion detection
for tracking
Contemporary approaches
Tracking by
detection
Particle swarm
optimization
Sparse representa
tion
Integration of context
information
8
Literature Survey
2.3 Classical Tracking Approaches In this chapter, following widely known classical approaches for visual object
tracking are discussed: (1) mean-shift, (2) Kalman filtering, (3) correlation based
template matching, and (4) motion detection based tracking algorithms. The main aim
of this section is to highlight different tracking algorithms using aforementioned
approaches.
2.3.1 Mean Shift for VOT Mean shift is a non-parametric statistical iterative method, originally developed by
Fukunaga and Hostetler [104]. It is used to find the mode of a distribution provided
its discrete points are given, therefore, it is useful in data clustering. Cheng [105]
unleashed it to the image processing community. It is a very simple and
straightforward algorithm. It randomly picks image pixels as representatives of cluster
centers. A hypothesized multidimensional ellipsoid is centered on each cluster center
and moved to the mean of the data lying inside the ellipsoid. The similar process is
repeated for all the clusters. The mean is iteratively calculated and cluster centers are
moved accordingly until there is no change in the mean value. Adjacent and similar
regions, (similarity depends upon application type and user defined criteria) are
merged during iterations and the number of final clusters may be much less than the
initial number of clusters. Eq. (2.1) describes the calculation of mean shift vector as
given by [106]:
2
1
2
1
ni
ii
ni
i
gh
gh
=
=
− = − −
∑
∑
x xx
m(x) xx x
(2.1)
where g(.)is kernel function, x is the center point and xi is the data points.
Mean-shift based schemes suffer from a few drawbacks. They require manual
adjustment of system parameters such as smallest and largest possible window size,
spatial kernel bandwidths, etc. Starting with its application in image segmentation
[105, 107], mean shift gained popularity in the field of VOT following the research
work of Comaniciu et al. [10]. In this paper, mean shift was used for real-time
tracking of non-rigid objects when viewed using a moving camera. Probability density
9
Literature Survey
(histogram) was used to model the target and color was used as feature for tracking.
The mean shift algorithm finds the most probable position of the target in each
upcoming frame. Comparison of probable target candidates with the original target
model was made using metric based on Bhattacharya coefficient. This work was
extended as kernel-based object tracking in [17]. The proposed scheme proved to be
computationally fast, and robust against clutter, occlusions, camera orientation, and
scale changes, in several scenarios, but it was not successful against illumination
changes and unpredicted object motion. Moreover, spatial information about the
target is lost due to color histogram as target representative, and Bhattacharya
coefficient is not a strong discriminative measure [108]. Yang et al. [24] introduced a
new similarity measure using RBF kernel which is expectation of the spatially
smoothed density estimates over the model image, which improved the robustness and
frame rate of tracking. Beleznai et al. [25-29] exploited mode seeking capability of
mean shift and applied to difference image for detecting and tracking humans in a
video. They used a fast version of mean shift for change detection in video. The
model based validation scheme was used to approve detected change as humans. Fast
mean shift finds clusters in the difference image and updated cluster parameters (e.g.,
cluster centers) are used for tracking purposes. The idea of mean shift for VOT was
extended by Zirkovic et al. [34], and it was used not only for finding local mode of
the density function, but also for estimation of the local mode shape. The algorithm
shows robustness for scale changes and adoption of shape, but it was fragile in the
case of clutter in the background, the presence of multiple targets, rapidly changing
appearance of the target and its motion. Zhou et al. [39] introduced new cost function
for tracking of non-rigid objects and improved the performance of Zivkovic et al. [34]
in complex scenes. Their algorithm optimally adapts ellipse for marking the TOI. The
new cost function contains Lagrange base regularization factor which decreases the
difference between estimated and expected probability distributions. The algorithm
shows better results, but it requires more prior information and is computationally
slower than [34]. Ning et al. [46] used mean shift approach with joint color texture
feature to track the target in complex environment. Shan et al. [52] introduced mean
shift with particle filtering for its sampling efficiency. Their work generated good
results for rapid motion with less number of particles than that of particle filter alone,
but it did not perform well for occlusion and cluttered background. Wang et al. [57]
applied mean shift on infrared imagery to track humans. They used motion guided
10
Literature Survey
gray and edge cues to improve mean shift results. Their algorithm works only for
fixed camera. Following issues arise when the mean shift approach is used for VOT
with histogram as target representative:
• Mean shift approach converges locally due to local basin of attraction
Table 2.2 Comparison of different VOT algorithms using Mean Shift (S/M - Single target or multiple target, O - occlusion, IV - high illumination variations, SV - sudden and large change in target velocity, SC - scale change). Symbols √
and ⅹ, respectively, show that algorithm does or does not handle the issue.
Representative work Target representation Similarity
measure
Issues
S/M O IV SV SC
D. Comaniciu et al. [10],
[17]
Color histogram Bhattacharya
Coefficient
S √ ⅹ ⅹ √
C. Yang et al. [24] Joint Spatial-feature
space
Expectation of
density estimates
S √ ⅹ √ √
C. Beleznai et al. [25-29] Difference image No similarity
measure
M √ √ √ ⅹ
Zirkovic et al. [34] Color histogram Expectation
Maximize like
Algorithm
S √ ⅹ ⅹ √
Zhou et al. [39] Color histogram Expectation
Maximize like
algorithm with
ellipse outlining
the target
S √ ⅹ ⅹ √
Ning et al. [46] Joint color-texture
histogram
Bhattacharya
Coefficient
S √ ⅹ ⅹ √
Shan et al. [52] Motion color Distance
function
S √ √ √ ⅹ
X. Wang et al. [57] Motion and gray edge
cues
Bhattacharya
Coefficient
S √ √ √ ⅹ
A. Adam et al. [4] Fragment base
histogram
representation
Earth Mover's
Distance
S √ ⅹ ⅹ √
J. Jiakar et al. [61] Fragment base
representation
Bhattacharya
Coefficient
S √ √ ⅹ √
M.I. Khan et al. [65] Fragment base edge
representation
Normalized
correlation
S √ √ ⅹ √
11
Literature Survey
• Spatial information is lost due to use of the histogram
• Due to the global nature of template model, it cannot handle occlusion
(even if it is partial) with good accuracy.
First two issues were handled using different variants of the mean shift
algorithm such as in [24, 109] and third one was tackled using the fragment based
approach. Adam et al. [4] proposed the fragment based VOT using mean shift to
handle the last two of the aforementioned issues. They selected fragments, randomly,
in spite of using model based patches (e.g., head, limb, torso). These spatially non-
overlapping patches help in preserving the spatial information. Multiple histograms
were used to represent each sub-region or patch of the template. The template position
in upcoming image frame is calculated using vote map formed by each patch
individual vote. Integral histogram technique was used to make the algorithm
efficient. Their algorithm shows robustness to partial occlusion but it lacks the
method of selection of different patches. Jiakar et al. [61] combined fragment based
approach with mean shift. User was taken into the loop for selection of fragments
manually. The patches may be overlapping or non-overlapping. Bhattacharya
coefficient based metric was used for similarity measure. The algorithm showed
impressive results in case of partial occlusion; it also handled illumination,
appearance, scale changes, and cluttered background. Khan et al. [65] used
normalized correlation for patch based template matching to track the target in the
presence of occluded and cluttered imagery. The template was partitioned into nine
non-overlapping fragments. Table 2.2 summarizes the comparison of different
algorithms using mean shift approach. It has four columns, the first column contains
the representative work and the rest of the columns describes the target
representation, similarity measure, and the issues handled by each method.
2.3.2 Kalman Filter for VOT Kalman Filter (KF) is a statistical parametric recursive algorithm specially
designed for discrete time systems. It is based on a motion model of a linear dynamic
system; therefore, requires its state space representation as shown in Eq. (2.2) and Eq.
(2.3) [110]
1n n n+ = +X ΦX U (2.2)
12
Literature Survey
n n n= +Y MX V (2.3)
where Xn symbolizes the state vector, Φ represents state transition matrix, Un
denotes the system noise vector, Vn is the observation noise vector, Yn is the
measurement vector, and M shows the observation matrix. KF estimates states of the
dynamic system in the presence of (1) noisy measurement (Gaussian noise), and (2)
uncertainty in the model of dynamic system. It works in prediction-correction cycle
format. KF, based on observed (measured) states, corrects its predicted states as well
as update its gain matrix for better future predictions as described by Eq. (2.4) to
Eq.(2.9) [111-113].
( )| | 1 | 1n n n n n n n n∗ ∗ ∗
− −= + −X X K Y MX (2.4)
where |n n∗X represents the posteriori measurement, | 1n n
∗−X the prior
measurement, and Kn the Kalman gain matrix defined as:
1| 1 | 1
T Tn n n n n n
−∗ ∗− − = + K S M R MS M (2.5)
where Rn is the observation noise covariance calculated by Eq. (2.6), | 1n n∗
−S
represents the predictor error covariance defined by Eq.(2.7) .
( ) [ ]Tn n n nCOV E= =R V V V (2.6)
where E[.] is the expected value.
| 1 | 1 1| 1( ) Tn n n n n n nCOV∗ ∗ ∗
− − − −= = +S X ΦS Φ Q (2.7)
[ ]1| 1 1| 1 1 1| 2( )n n n n n n nCOV∗ ∗ ∗− − − − − − −= = −S X I K M S (2.8)
where Qn is the noise covariance matrix and is calculated by Eq. (2.9).
( ) [ ]Tn n n nCOV E= =Q U U U (2.9)
13
Literature Survey
Derivation of KF equations can be found in [111]. In VOT, KF is widely used
in conjunction with other algorithms [9, 17, 21, 32, 36, 44, 114-119]. KF normally
acts in two modes during tracking, which are: (1) normal tracking mode, in which KF
predicts the target coordinates in the image plane for the next frame on the basis of its
position in the current frame. It helps to find the position of the search window [44],
and (2) occlusion mode, in which KF ignores the measured value and uses its
predicted value for next state prediction. Thus, it is used to handle short-term
occlusion. Figure 2.2 (Up) shows the normal tracking mode, in which KF predicted
and actual measured target position overlaps each other [120]. Figure 2.2 (Down)
illustrates the occlusion mode during tracking in which KF does not rely on measured
Figure 2.2 (Up) Normal tracking, estimated position by Kalman filter follows the measured position, (Down) Tracking during occlusion
using Kalman filter
14
Literature Survey
target position, and uses its own prediction to yield to target position in the next
frame. Ahmed et. al. [44] combined KF with normalized correlation to handle short
term occlusion as well as to find the most likely position and size of the search
window for the next frame. Jang et al. [15] used KF for target motion prediction in
order to reduce the search space for matching the target. Comaniciu [9] combined KF
with the mean shift tracker but it could not handle the large movement of the target.
Ali et al. [36] used correlation and fast mean shift algorithms with KF to handle the
complex maneuvered motion of an airborne object. Li et al. [21] used KF with mean
shift and fast motion estimation algorithm to handle the large and sudden movement
of the target. Li et al. [32] used Bhattacharya coefficient for adjusting the KF
estimation parameters adaptively. Their results show the robustness against partial or
full occlusions, fast target motion, and sudden changes in the target velocity. Ridder
[49] used KF for the discriminative tracking approach. They model each pixel with
KF in order to handle the variation in illumination. This way, KF is used for adaptive
background estimation and foreground detection. Peterfreund [121] used KF with
active snakes for robust tracking of position and velocity of non-rigid as well as rigid
objects. He used image gradient along the contour and its optical flow was used as
system measurement.
KF assumes a linear dynamic model of target motion and Gaussian noise in
measurement, which is not always true in the real world. Therefore, its different
variants such as Extended KF (EKF), and unscented KF (UKF) are introduced [122].
EKF applies first order Taylor series to approximate a nonlinear system. Whereas,
UKF does not apply such approximation, it uses unscented transformation and
generate a set of sigma points, which are transferred to a dynamic model of state and
observation for the final result. UKF results are better than that of EKF but it assumes
Gaussian distribution for posterior, therefore, it cannot work in case of multi-modal
distributions. Therefore, Particle Filter (PF) [69] is used to cater these issues. PF is a
non-parametric Monte Carlo simulation based method [123]; it was used in tracking
application first time by Isard and Blake [59] with the name of Condensation. PF
represents state of the target by a set of weighted particles. The weight to each particle
is assigned according to its contribution in finding the target's location. The position
of each particle is updated according to the motion model and the measurement data.
PF suffers from the problem of sample improvishment in which samples contribute no
15
Literature Survey
useful information in estimating the target position. Further detail can be found in [69,
124]. Table 2.3 briefly describes the different VOT approaches exploiting KF,
representative work, and issues handled by each technique.
2.3.3 Correlation based Template Matching Template matching or correlation tracking is the classical method in the field of VOT
with its history since 1973 [56, 103, 125]. The process of tracking is started by
selecting the target in the first frame manually or by some automatic target detection
system. The representation of the target is called a template, which is used to locate
the target by correlating it with the video frame in each iteration. The location with
the highest correlation score is considered as the new target position. Different
correlation metrics, e.g., standard correlation (SC) [126] (Eq. (2.10)), phase
correlation (PC) [127] (Eq. (2.11)), normalized correlation (NC) [126] (Eq. (2.12)),
normalized cross correlation (NCC) [128, 129] (Eq. (2.13)), are usually used as
similarity measure in tracking applications. The detail of these metrics can be found in
[44].
Table 2.3 Comparison of different VOT approaches exploiting KF (OS- optimum search, O-occlusion, LM - large target movement, SV- sudden
change in velocity). Symbol √ shows that the tracking algorithm handles the issue and symbol ⅹ means it does not tackle the issue.
VOT approaches exploiting KF Representative
work
Issues
OS O LM SV
Mean Shift and KF D. Comaniciu [9] √ √ ⅹ ⅹ
Jang et al.[15] √ ⅹ ⅹ ⅹ
Zhulin Li et al. [21] ⅹ ⅹ √ √
Xiaohe Li et al. [32] ⅹ √ √ √
Correlation and KF A. Ali et al. [36] ⅹ √ √ √
J. Ahmed et. al. [44] √ √ ⅹ ⅹ
Background / Foreground Detection and
KF
C. Ridder [49] ⅹ ⅹ √ √
16
Literature Survey
1 1
0 0( , ) ( , ) ( , )
K L
i jc m n f m i n j t i j
− −
= =
= + +∑ ∑ (2.10)
.F Tc real idftF T
∗ =
(2.11)
1 1
0 0
1 1 1 12 2
0 0 0 0
( , ) ( , )( , )
( , ) ( , )
K L
i j
K L K L
i j i j
f m i n j t i jc m n
f m i n j t i j
− −
= =
− − − −
= = = =
+ +=
+ +
∑∑
∑∑ ∑∑ (2.12)
1 1
0 0
1 1 1 12 2
0 0 0 0
[ ( , ) ][ ( , ) ]( , )
[ ( , ) ] [ ( , ) ]
K L
s ti j
K L K L
f ti j i j
f m i n j t i jc m n
f m i n j t i j
µ µ
µ µ
− −
= =
− − − −
= = = =
+ + − −=
+ + − −
∑∑
∑∑ ∑∑ (2.13)
where f is the image, t is the template, F and T their Fourier transforms, T* is
conjugate of T, idft(.) is inverse discrete Fourier transform operator, real(.) extracts
real part of its operand, μf and μt shows mean of image and template respectively.
SC does not have any bounding value, therefore, no threshold value can be set
to validate the match score and update the template. Moreover, it is sensitive to
illumination and produces a peak matching value at the brightest spot in the image.
PC computes correlation in the Fourier domain. It is insensitive to variations in image
intensity because it ignores the Fourier magnitude and calculates the phase component
only. It has strong discriminatory power and produces a sharp peak, but it is not as
robust to noise when compared to SC [130]. Moreover, it assigns equal weight to all
of its components which seem inappropriate as significant components should ideally
be more weights as opposed to other components [128]. Due to these discrepancies,
PC may yield false positive [44, 131, 132]. Different variants of PC have also been
proposed [133-135], yet they are not as robust to variation in appearance,
illumination, and contrast as NC and NCC. These two metrics have their values in the
ranges [0, 1] and [-1, 1] respectively. Therefore, it is easy to set a threshold for
17
Literature Survey
template updating and occlusion handling. Updating the template is mandatory for
tracking an object changing its appearance. Ali [36] updated the template completely
on every next frame if the peak correlation value is higher than a threshold. The
updating scheme suffers from fast template drift problem if the newly found template
position is not the exact position. To handle this problem, Ahmed et al. [44] updated
the template smoothly using a first order IIR filter. Most of the times, NCC is used as
a similarity measure in image registration [126, 128, 136, 137], but NC performs
better than NCC when edge-enhancement is performed as preprocessing step of target
tracking [44]. Aasgrizadeh et al. [138] integrated Region Mutual information (RMI)
with edge correlation tracking for more robust tracking of aerial objects. RMI
provides information about the clutter and clear backgrounds as well as high
luminance changes. Table 2.4 shows the comparison of above mentioned correlation
metrics with respect to their discriminatory power and robustness to noise.
2.3.4 Motion Detection for Tracking There are various methods for motion detection, including background subtraction,
temporal differencing, background modeling, and optical flow.
Background Subtraction
Target or area of interest in a scene is referred to as the foreground and anything else
in the image is termed as background. Background subtraction or foreground
detection may be used for two purposes. First, it may be used to initialize the tracking,
and second, it is used to detect the target of interest from frame to frame. The simplest
method for foreground detection is to subtract each frame from a fixed background
model in case of stationary background. The pixels corresponding to background will
Table 2.4 Comparison of different correlation metrics.
Sr. No. Correlation Type Discriminatory
power
Robustness to noise
1 Standard correlation Poor Poor
2 Phase correlation Strong Poor
3 Normalized correlation Strong Strong
4 Normalized cross correlation Strong strong
18
Literature Survey
yield a very low value and the pixels related to foreground will create high values in
the subtracted image. Thus, a threshold can be set to distinguish the foreground pixels
from the background pixels. Connected component algorithm is used to group the
foreground pixels, and the target is searched only at the foreground regions. This way,
exhaustive search is avoided and efficiency of the algorithm is improved. This
straightforward method of background subtraction normally works in a structured
environment, and it fails in the case of un-structured or outdoor environment where
illumination and background does not remain stationary.
Fixed background does not work in case of outdoor, therefore, adaptive
background model is used. Wren et al. [139] proposed a unimodal Gaussian model for
each pixel using its mean and variance in YUV color space. Their algorithm works
well in order to handle small illumination changes, but it does not show efficacy in
case of sudden illumination changes (e.g., flashing light, swaying trees or bushes,
moving fountains, and rotating fins of a fan, etc.). These issues are handled by the
work of Stauffer et al. [101, 140, 141]. They adopt multi-model Gaussian
representation for each pixel, and update the model online to learn the changing
background. Usually, 3 to 5 models are used to model each pixel distribution. If
match of a current pixel is found with any Gaussian distribution, it is considered as
background, otherwise, it is classified as foreground pixel, and background model is
updated according. This multi-model Gaussian approach does not tackle the problems
of drastic illumination change, and moving shadows. KaewTraKulPong et al. [142]
improved the learning rate of Guassian mixture model and introduced shadow
detection method. Their algorithm compares the foreground pixel with the
background model. If the difference between chromatic and brightness components is
within a certain threshold, it is considered to be a part of the shadow. A similar
technique was presented by Horprasert et al. [143, 144]. Haritaoglu et al. [81]
developed a real-time surveillance system that trains background model by three
features, i.e., minimum pixel value (m), maximum pixel value (n), and maximum
intensity difference between consecutive frames (d). A pixel is classified as
foreground, if its difference from m or n is greater than d, otherwise, it is taken as
background pixel. Oliver et al. [145] used principle component analysis and Eigen
decomposition to build Eigen background. The projected image of the current frame
is subtracted from the Eigen background to detect foreground objects.
19
Literature Survey
Temporal Differencing
Temporal differencing means subtraction of previous frame from current frame to
detect any change or moving objects in the scene. Lipton et al. [146] use temporal
differencing between two consecutive frames for foreground detection. Their
approach used multiple hypotheses for classification of foreground regions as targets
of interest. The classification metric employs perimeter and area to identify the targets
in the difference image. In order to improve the detection of foreground regions, three
inter-frames temporal difference scheme may also be used [147, 148]. Temporal
difference methods are sensitive to the threshold used to discriminate between
foreground and background regions, as well as illumination changes. Moreover, when
the target stops moving, it cannot be detected as a foreground object.
Optical Flow
Optical flow is the apparent motion pattern in an image of a scene due to relative
motion between objects of the scene and the camera. The calculation of optical flow
assumes brightness consistency between corresponding pixels in the scene. There are
various methods for calculating dense optical flow in an image such as Lucas and
Kanade [56], Horn and Schunck [149], Black and Anandan [150], and Szeliski and
Couglan [151]. Optical flow is used as a feature in segmentation and tracking
applications based on the motion of objects. Shi and Tomasi [152] exploited optical
flow to find out the motion of a region in an image and developed their well-known
KLT tracker. The tracker is sensitive to illumination changes and large frame motion.
Rangarajan and Shah [153] used optical flow to find initial inter-frame
correspondence between first two frames to their proposed greedy search algorithm.
Papageorgiou et al. [154] used optical flow to reduce the search space of their SVM
based pedestrian and face detection algorithm. Cremers et al. [155] used optical flow
as a feature in contour based tracking algorithm. Li et al. [156] used optical flow for
silhouette tracking. Bertalmio et al. [157], and Mansouri [158] used optical flow for
minimization of contour energy.
2.4 Contemporary Tracking Approaches In this Section, we will investigate the recent approaches for VOT which includes, (1)
tracking by detection, (2) sparse representation, (3) particle swarm optimization, and
(4) integration of context information.
20
Literature Survey
2.4.1 Tracking by Detection It includes the class of algorithms which considers the tracking phenomenon as the
detection process of target in consecutive image frames by training a binary classifier
to discriminate the target from its background. The class of algorithms is termed as
tracking-by-detection or tracking-by-repeated-recognition algorithms [3]. These
methods have gained popularity in recent years due to their efficacy in performance
and simplicity of classification task [2, 6, 18, 53, 159-161]. Detailed discussion on
different classifiers can be found in [162, 163]. Normally, a classifier requires data for
its for its training, but no prior knowledge is available about the target position in case
of tracking application. Therefore, training data are generated online during tracking
and the classifier is updated accordingly. It is called Adaptive tracking-by-detection.
Collin et al. [6] presented an approach for online selection of features to discriminate
the target from its background. The estimated position of the target in each frame is
considered as a positive example and its nearby locations are treated as negative
examples for updating the classifier. This step is called Generation and Labeling of
Samples [11]. During tracking, classifier finds the target position by maximizing the
classification score in a local region, normally around the target position found in the
previous frame, using the sliding window method. Figure 2.3 explains this tracking
Figure 2.3 (Source [1]): Adaptive tracking-by-detection process, i.e., tracking the target and updating the classifier.
21
Literature Survey
and updating process. Avidan [18] introduced Support Vector Tracking (SVT), which
integrates Support Vector Machine (SVM) classifier with optical flow for vehicle
tracking. Grabner et al. [2] presented the online version of AdaBoost approach for
real-time tracking. The tracking approaches [2, 6, 18] update their classifiers by
considering a single only positive example consisting of current position of the target,
and many negative examples, i.e., samples around the current target position as shown
in Figure 2.4. Small inexactness in the target position results in poorly labeled training
samples. It is called label jittering which abates the performance of classifiers and
ultimately causes the drift problem. Therefore, most of the recent tracking-by-
detection approaches try to improve tracking performance by making the classifier
more robust to incorrectly labeled examples [3, 40, 50, 164-166]. Babenko et al. [3]
presented Online Multiple Instance Learning Boosting (Online MILBoost) algorithm
for robust tracking. Instead of assigning label to each individual example, their
algorithm combines instances into bags, and a label is assigned to each bag. A
positive bag should contain at least one positive example; otherwise, negative label is
assigned to it as shown in Figure 2.5. Their algorithm shows prominent results against
drift issue, but it fails to recapture the target if it gets out of the scene and returns. This
is the problem with all adaptive appearance model based algorithms that they start
updating themselves with a false object if the target is fully occluded or it gets out of
Figure 2.4 Positive and negative samples for online AdaBoost [2]
Figure 2.5 Positive and negative bags for MIL
classifier [3]
22
Literature Survey
scene for a while. Grabner et al. [40] solved this issue by using semi-supervised
appearance model updating method. Their method combines the labeled data (prior
knowledge), i.e., the target selected by the user in the first frame, along with current
unlabeled data. This way, the method becomes robust to drift issue, but it shows less
adaptability to appearance changes. Zeisl et al. [50] combined the strength of semi-
supervised and multiple instance learning into a single framework. Zhang et al. [45]
improved the work of Babenko et al. [3] by assigning different weights to different
instances. The closer the instance to the target, the higher the weight assigned to it.
William et al. [167] points out that the highest classification score does not
necessarily belong to the target position, as there is no explicit relationship between
classification confidence and the target spatial position. Sam Hare et al. [11] presents
a framework based on structured output prediction which explicitly incorporates the
tracker need of labeled training examples into the output space. Instead of learning
classifier, the framework focuses on estimating the target transformations by using
structured output SVM. Thus, it gets rid of intermediate step of Generation and
Labeling the Samples. Table 2.5 summarizes the representative work of tracking-by-
detection technique by mentioning discriminatory technique used in each work.
2.4.2 Particle Swarm Optimization Particle swarm optimization, inspired by the birds searching for food, was first
introduced by Kennedy et al. [168, 169] in 1995. Since then, its use in different
Table 2.5 Representative work of tracking-by-detection technique.
Sr. No. Representative work Discriminatory technique
1. Collins et al. [6] Selection of features by ranking
2. Hare et al. [11] Structured SVM learning
3. Avidan [18] SVM learning
4. Grabner et al.[2] Boosting by selection of features using feature-
ranking
5. Babenko et al. [3] Boosting by multiple instance learning
6. Grabner et al. [40] Semi supervised boosting
7. Zhang et al. [45] Boosting by weighted multiple instance learning
8. Zeisl et al.[50] Semi-supervised multiple instance learning
23
Literature Survey
applications is increasing day by day and it has drawn attention of researchers in
different fields [170-172]. It is a stochastic process exploiting the phenomenon of
swarm intelligence and works on collective wisdom, which prevails by each of its
particles. Each particle in PSO updates its position by considering its own best
position as well as best position of its neighborhood, until all the particles find a
common converging position or the maximum number of iterations are completed.
The size of neighborhood may vary from one particle to the entire swarm around the
current particle. An objective function is required to calculate the fitness map of each
particle. The position of a particle with the highest value on fitness map is considered
its best position and the best among these best positions is the global or the swarm
best position. PSO has very simple formulation consisting of velocity and position
update equations described by Eq. (2.14) and Eq. (2.15).
1 1 2 2( 1) ( ) ( ( )) ( ( ))i i i i id d d d d dv n wv n c r p x n c r g x n+ = + − + − (2.14)
( 1) ( ) ( )i i id d dx n x n v n+ = + (2.15)
Where ( )idv n and ( )i
dx n represent the velocity and position of ith particle in d-
dimension at iteration n, respectively, idp is the particle personal best position, dg is
the global best position of the swarm, w is the inertia weight, c1, c2 are constants, r1
and r2 are random values have values in the range of [0, 1]. Different variants of PSO
Table 2.6 Representative work of using different variants of PSO in VOT
Sr. No. Representative work PSO variants
1. Zhang et al. [7] Sequential PSO
2. Zhang et al. [16] Species PSO
3. Akbari et al. [23] Standard PSO
4. Kwolek et al.[31] Standard PSO
5. Anton et al. [38] Predator-Prey PSO
6. Zheng et al.[43] Standard PSO
7. A. Tawab et al. [48] Standard PSO
8. Borra et al. [54] PSO-FCM
24
Literature Survey
and their applications can be found in the [170, 173, 174]. In visual tracking
applications, PSO is used to search for the best candidate position of the target in the
current frame. Zhang et al. [7] introduced the sequence of temporal information of the
target into PSO and name it as Sequential PSO. They present particle filter based
tracking algorithm with hierarchical importance sampling process guided by
sequential PSO. Thus, their approach helps to cope with the classic sample
impoverishment problem of particle filter. Zhang et al. [16] propose species based
PSO for tracking of multiple objects. Each species are used to track the individual
object. Thus, different trackers run under a single framework. The inter-objects
occlusion is handled by species competition and repulsion. The number of objects and
hence species are initialized by the user at the very beginning of the tracking process.
R. Akbari et al. [23] combine PSO and KF to track multiple objects in cluttered
environments. They use non-overlapping fragments based representation of objects.
Each fragment is represented by a particle. The particles of PSO are guided by KF in
a hybrid framework using the region as well as object information. Kwolek et al. [31]
proposed multi-object tracking algorithm which used PSO to improve the target
position found by discriminative appearance models. The objective function is based
on fragments based representation of targets and their covariance matrix. A. Canalis et
al. [38] use PSO for tracking target in predator-prey style. Each particle directly
interacts with a pixel, and tracking is performed by interaction of particle with its
environment. Zheng et al. [43] represent the target into multi-dimensional feature
space and employ PSO algorithm to expedite the search process. Bhattacharya
coefficient is used as fitness function for PSO in their algorithm to track faces and
vehicles. Abdel Tawab et al. [48] proposed PSO based fast gray level object tracking.
They employ combination of SIMilarity (SIM) and Bhattacharya coefficients as
fitness functions to evaluate the score of particles of PSO. Borra et al. [54] proposed
PSO-Fuzzy C Means (FCM) based tracking algorithm. PSO-FCM is used to segment
the objects in the scene and a pattern matching approach is used to track the target.
Table 2.6 summarizes the representative work of PSO in VOT.
2.4.3 Sparse Representation Compressive or sparse representation [175, 176] of a signal exhibit the signal as linear
combination of small number of basis vectors. The representation is becoming
popular in various pattern recognition and image processing applications [177-180].
25
Literature Survey
X. Mei et al. [19, 181, 182] used sparse representation for object tracking. The
algorithm is capable to cope with occlusion problem using trivial templates and l1
minimization approach for sparse representation. Trivial templates are used to model
targets as well as background, which make reconstruction error small for both the
regions. The candidate region with minimum reconstruction errors is considered the
target. The algorithm is expensive with respect to computations due to l1 minimization
algorithm. Liu et al. [183] improved tracking efficiency and robustness by exploiting
sparseness and using a set of discriminative features. The algorithm uses a fixed
number of features, therefore, it is not effective in complex or dynamic environment.
Liu et al. [184] used mean shift and histograms based local sparse representation for
appearance model. However, histogram representation is unable to distinguish
between the targets and the background due to its inherent problem of missing spatial
information. Zhong et al. [67] made the object tracking robust to occlusion using
hybrid approach of sparsity-based discriminative classifier (SDC) and sparsity-based
generative model (SGM). SDC assigns a higher confidence to foreground objects than
background objects. SGM proposes a new method to calculate histograms which
preserves the spatial position of each patch as well. Jia et al. [64] proposed structure
sparse representation of the appearance model for target tracking. The algorithm
proposes an alignment-pooling method for partial as well as spatial information in
order to tackle the occlusion problem. Moreover, the algorithm proposes a novel
template updating method based on incremental subspace and sparse representations.
Present tracking algorithms update appearance model using current image frame.
Therefore, these methods are data dependent. Zhang et al. [62] exploit multi-space
features for appearance modeling based on data independent basis. The algorithm uses
random projections to protect the feature space of objects in the image. Sparse
representation is employed to extract the features. Detail review and experimental
comparison of sparse coding based visual tracking may be studied from the paper of
Zhang et al. [185]
2.4.4 Integration of Context Information Integration of context with the target of interest for robust object tracking has gained
significant importance in recent years. Various psychophysics studies have
emphasized the role of context in image understanding for human perception system
[186]. Detailed discussion of the role of context in object detection can be found in
26
Literature Survey
[187]. Yang et al. [8] proposed a number of nearby objects to the target as spatial
context to enhance the appearance model. These objects are automatically extracted
from the video during run-time and are named as auxiliary objects. Auxiliary objects
are chosen at least for short time interval according to following criterion: (1)
straightforward to track, (2) persistent co-occurrence with the target, and (3)
consistent motion correlation with the target. Li et al. [13] modelled contextual
relationship with a dynamic Markov random field for simultaneously recognizing,
localizing, and tracking multiple objects of different categories in meeting room
videos. The Spatio-temporal relationship is used to get information about object
category and its state. Nguyen et al. [22] use spatio-temporal context for multi-target
tracking. Spatio context includes nearby objects and temporal context contains all
previous target models based on Probabilistic Principle Components Analysis
(PPCA). Wen et al. [33] also proposed spatio-temporal context relationship for robust
object tracking. Grabner et al. [37] used Hough transform to integrate temporal
context (supporter) and distinguish between strong and weak coupled motions. Their
algorithm works well in case of full occlusion and target changing its appearance
heavily and rapidly. Table 2.7 summarizes the representative work of exploiting
context information for VOT.
2.5 Evaluation Methods for VOT Algorithms and
Benchmark Resources VOT algorithms are evaluated qualitatively as well as quantitatively. For qualitative
comparison, sample image frames are shown and visually examined. The visually
better results are considered those which have tracked rectangle closer to the target of
Table 2.7 Representative work of exploiting context information for VOT
Sr. No. Representative work Contextual information
1. Yang et al. [8] Spatial position
2. Li et al. [13] Spatio-temporal relationship
3. Nguyen et al. [22] Spatio-temporal relationship
4. Wen et al. [33] Spatio-temporal relationship
5. Grabner et al. [37] Temporal context
27
Literature Survey
interest as shown in Figure 2.6. Qualitative analysis does not provide a fair
comparison between different algorithms. Therefore, quantitative solution is
calculated to have a better understanding of the robustness of the algorithms. For this,
two measures are employed, one is the mean distance from center location, it provides
the error between the center location of tracking rectangle and its ground truth value.
The overall performance of the algorithm is summarized by computing the mean of
the center location errors for all the frames in a video. The problem with this method
is that if a particular method successfully tracks a target in most of the video frames
and lost the target in a few frames with a large distance, its performance will be poor
in comparison with a method which mostly does not track the target but clings to
background nearby the target. Therefore, this method is not a true representative for
performance of a particular method. One modification to this method is made by
Babenko et al. [3] and Henriques et al. [188]. They calculate the percentage of frames
in which the distance between the tracked location and the ground truth location is
less than a fixed threshold (e.g., 20 pixels). The other quantitative measure is termed
Frame 360 Frame 607 Frame 776 Frame 1115
Frame 1183 Frame 1236 Frame 1319 Frame 1355
Frame 1438 Frame 1462 Frame 1504 Frame 1517
Figure 2.6 A few tracked frames of Liquor video sequence. The yellow rectangle shows the tracked window, the more closely to the target, the better
the result.
28
Literature Survey
as Pascal score. It finds out the overlapping area between the tracked target and its
ground truth value as described by Eq. (2.16)
( )( )
t g
t g
area R Rp
area R R∩
=∪
(2.16)
Table 2.8 List of a few online publicly available tracking resources.
Sr. No.
Name Dataset Ground truth
Source Code
URL:
1. Fragtrack [4] √ √ √ www.cs.technion.ac.il/~amita/fragtrack/fragtrack.htm
2. Incremental visual tracker [14]
√ √ √ www.cs.utoronto.ca/~dross/ivt/
3. 1 tracker [19] ⅹ ⅹ √ www.ist.temple.edu/~hbling/code data.htm
4. Kernel based tracker [17]
ⅹ ⅹ √ code.google.com/p/detect/
5. Boosting tracker √ ⅹ √ www.vision.ee.ethz.ch/boostingTrackers/
6. MIL tracker [3] √ √ √ vision.ucsd.edu/~bbabenko/project_ miltrack.shtml
7. Visual tracking decomposition [42]
√ √ √ cv.snu.ac.kr/research/~vtd/
8. Structural SVM tracker [11]
ⅹ ⅹ √ www.samhare.net/research/struck
9. PROST tracker [53] √ √ √ gpu4vision.icg.tugraz.at/index.php?content=subsites/prost/prost.php
10. KLT tracker [56] ⅹ ⅹ √ www.ces.clemson.edu/~stb/klt/
11. Condensation tracker [59]
√ ⅹ √ www.robots.ox.ac.uk/~misard/condensation.html
12. Caviar sequences √ √ ⅹ homepages.inf.ed.ac.uk/rbf/CAVIARDATA1/
13. PETS sequences √ √ ⅹ www.hitech-projects.com/euprojects/cantata/datasets cantata/dataset.html
14. Compressive tracking [62]
√ √ √ www4.comp.polyu.edu.hk/~cslzhang/CT/CT.htm
15. Structural local sparse tracker [64]
√ √ √ ice.dlut.edu.cn/lu/Project/cvpr12 jia project/cvpr12 jia project.htm
16. Sparsity-based collaborative tracker [67]
√ √ √ ice.dlut.edu.cn/lu/Project/cvpr12 scm/cvpr12 scm.htm
29
Literature Survey
where Rt and Rg are the tracked target region and its ground truth region,
respectively. ∩ and U shows the intersection and union symbols respectively. Pascal
score may have a value from 0 to 1 in a closed interval. If there is no overlapping
region, its value is 0, it gains value of 1 in case of full overlap. The target is
considered to be successfully tracked in a frame if its Pascal score is greater than 0.5,
(i.e., at least fifty percent overlap). In order to have a fair comparison between
different tracking algorithms, two things are required. One is the test videos with
annotations, and the other is implementations of the algorithms. Wu et al. [189] have
organized a dataset comprising of fifty videos with their ground truth values and a
code library having implementation of 29 tracking algorithms. They provide
performance evaluation and comparison of these algorithms over different parameters,
e.g., scale changes, illumination changes, occlusion handling, overall tracking
performance, etc. In order to make this chapter self-contained, list of a few publicly
available VOT resources is shown in Table 2.8.
2.6 Chapter Summary In this chapter, different object tracking algorithms have been investigated. The
proposed taxonomy categorizes the VOT algorithms into classical and contemporary
approaches. Thus, Mean shift, Kalman filter, motion detection, and template matching
based algorithms are presented as classical approaches for visual tracking, whereas
tracking-by-detection, swarm intelligence, and sparse representation, and integration
of context have made their place in contemporary tracking algorithms. Representative
work in classical and contemporary approaches has been investigated. It is clear from
the literature discussed in this chapter that no universal tracker exists which may work
equally well in all kinds of situations and environments. Most of the tracking
algorithms work in a structured environment or track a specific type of target. The
reason for this is that there is a lack of accurate mathematical models for complex
target motion and appearance change which may be the future research area for the
community working in computer vision. Distinct feature selection with high
discriminatory power in case of a cluttered background, motion blur and occlusion is
another future avenue to be explored by researchers. Online updating the classifier
and the template representing the target is also required to cope with the varying
appearance of a target. Current updating methods are prone to error due to the
inclusion of the background pixels in the model or classifier and suffer from the
30
Literature Survey
template drift problem. An accurate updating method is an active research field for
experts dealing with tracking problems. Context awareness based tracking approaches
have generally shown better results in recent years. Therefore, integration and
automatic extraction of contextual objects (supporters or auxiliary objects) may get
attention of researchers in future.
31
3 Proposed Template Updating Method
Visual object tracking can be considered as the process consisting of representation of
target, called template, and its localization in consecutive image frames. Template
updating is required to handle the changing appearance of the target. During updating,
a template allows some background pixels to enter into its model due to inaccuracy in
calculating the target position. As time passes, these errors are accumulated and
template start sliding off the target and finally, it is stuck in the background. This
problem is called template drift, and it is one of the most challenging problems faced
by tracking algorithms. Slow template updating will slow down the drift problem, but
it will not track the target changing its appearance rapidly, it is called stagnation to the
old appearance problem. On the other hand, frequent updating will prone more to the
drift problem. Thus, stability of tracking algorithm and its template adaptability at the
same time requires a trade-off. It may also be named as stability-plasticity dilemma
[53]. The existing template updating methods do not take into account the actual
appearance changes of the targets, therefore, they are not much effective against
template drift as well as stagnation to old appearance problems. This chapter proposes
a new template updating method for correlation based tracking algorithms which
updates the template according to the rate of change in appearance of the target and
finds a good tradeoff between stability-plasticity dilemma. Moreover, the method is
capable to revert the template near to some previously better representation if more
recent updating is incorrect. Thus, the proposed algorithm helps to overcome the
problems of template drift, especially during occlusion and complex (e.g., out-of-
plane) motion of the target. Experimental results and comparison with other
algorithms on different publicly available challenging videos prove the efficacy of the
algorithm.
3.1 Correlation based Template Updating Methods Target detection and tracking algorithms based on correlation are common in
computer vision community since 1979 [56, 103, 190-192]. In such algorithms, the
target is represented by an appearance model which is named as the template and it is
32
Proposed Template Updating Method
matched with the upcoming image or part of the image (called search window) to find
the position of best possible candidate for the target. Target may be represented by
different features such as intensity [36, 193], color [17], texture [194], etc. Detail for
selection of features can be studied from Yang et al. [30]. Recently, Edge Enhanced
(E2) template representation has proved its efficacy for robust object tracking [44,
138, 191, 195]. In E2 tracking, target may be selected by the user or it may be
detected by some target detection method to initialize the tracking process. The
success of correlation tracking is mainly based on two factors: one is the correlation
metric, which should have as less error as possible in calculating the target position
and the other is a template updating method, which finds tradeoff between stagnation
to old appearance and template drift problems. In this chapter, we will use
Normalized Correlation (NC) for template matching as it is better than other
correlation metrics (e.g., phase correlation, normalized correlation coefficient,
Bhattacharya coefficient) when the template is edge-enhanced [44, 191]. A new
method is proposed which updates the template according to the rate of change in
appearance of the target, i.e., the lower the changes in the appearance of the target, the
slower is the template updating, and the higher the changes, the greater the update
rate. Moreover, if the template is contaminated by background pixels or noise, it will
be restored to its previously found rather good appearance. Thus, the template is
updated properly, which in turn provides support to tackle the complex (e.g. out-of-
plane) motion of the target and slowly occurring long term occlusion.
3.1.1 Traditional Template Updating Methods In this section, we describe three traditional template updating methods.
Naive Template Updating Method
In this method, template is updated on every next frame or after a number of frames
provided peak correlation value is greater than a certain threshold, as follows:
1
if
otherwisen p
nn
b ct
tτ
+
>=
(3.1)
where bn is the best matched region in the current image, tn and tn+1 represent
the current and the updated templates respectively, cp is the peak correlation value,
and τ is the fixed threshold value. This scheme assumes bn as the true target (which is
33
Proposed Template Updating Method
not the case in reality) and completely replaces the current template. Furthermore, it is
highly prone to template drift problem.
α-Template Updating Method
This method does not replace the current template with the best match region at once,
rather it introduces a parameter α, 0 < α < 1, to smoothly update the template, as
follows:
1
( ) if
otherwisen p
nn
n nt b t ct
tα τ
+
+ >=
− (3.2)
If α is assigned a small value (e.g., 0.02) [196], it solves the template drift
problem, but it does not cater for rapid change in the appearance of the target and
remains stagnant in the target's old state. In order to address this issue, the idea of
using α =cp was presented [190]. However, during the normal tracking, the value of cp
is greater than 0.9, thus, it behaves the same way as the naive method does.
β-Template Updating Method
This method was proposed by Ahmed et. al. [44] and has same mathematical
formulation as that of alpha template updating method (see Eq. (3.3)). The only
difference is that α is replaced by β, where β = 0.15*cp. It smoothly updates the
template, but it does not work in case of the fast maneuvered target.
1
( ) if
otherwisen p
nn
n nt b t ct
tβ τ
+
+ >=
− (3.3)
3.2 Proposed Template Updating Method A good template updating scheme should handle the problems of template drift as
well as stagnation to old appearance. For this, the updating scheme should be such
that (1) it may incorporate maximum target changes; i.e., template updating scheme
should be dynamic based on the fact that whether target is changing its appearance
rapidly or slowly, (2) it should contain as small background as possible, and (3) if the
template is poorly updated with some background or noisy pixels, the updating
scheme should be able to restore the template to a better representation. In the
proposed method, the first template (which is selected by the user) is considered as
34
Proposed Template Updating Method
the most trusted one and it is kept in buffer throughout the tracking session. Let it be
denoted by t1.The last updated template is assumed to contain the maximum change in
the target's appearance and is represented by tn, where subscript shows the frame
number where template will be used to find targets. It is possible that tn may have
been corrupted due to occlusion or clutter, therefore, the second last template,
symbolized by tn-1, is also kept in memory, both the templates, tn and tn-1, are
correlated with the search window. Their peak correlation values are respectively
represented as cp(n) and cp(n-1). If cp(n) ≥ cp(n-1) , it is considered that last updated
template is the correct one, otherwise tn is replaced by tn-1. The next step is template
updating. For this, t1 is correlated with the search window. Its peak correlation value
is represented by cp(1) . If t1 fails to get, at least, 50% match in the search window, it is
assumed that slowly occurring occlusion is being faced by the target which corrupted
both tn and tn-1. This assumption holds because the datasets used have targets which
do not change their appearance so much. Therefore, we start updating tn partly by t1.
Equations (3.4) and (3.5) describe the process,
1 ( ) ( 1)if
otherwisen p n p n
nn
t c ca
t− −<=
(3.4)
1 (1)(1 ) if 0.5
otherwisen p
nn
a t cd
tω ω+ − <=
(3.5)
where 0 < ω ≤ 1. The latter process of template updating is mathematically
represented by Equations (3.6) - (3.10).
1
1
( ) if
if ( ) and ( )
(1 ) if ( ) and ( )
n n n p
n n p
n p
b d ct d c f
d t c f
d γ τ
τ λ
σ σ τ λ+
+ − ≥
= < ≤ + − < >
(3.6)
0 if
1 otherwisepc
ff
τ>=
+ (3.7)
* (1 )refc cγ δ δ= ∆ + − ∆ (3.8)
35
Proposed Template Updating Method
1ref pc c∆ = − (3.9)
1| |n np pc c c −∆ = − (3.10)
where superscript n shows the frame number, 0 ≤ γ ≤ 1, 0 ≤ σ ≤ 1, 0 ≤ δ ≤ 1, λ
> 0, and f is counter whose value in increased by 1 if peak correlation value, cp, is less
than threshold τ, otherwise its value is set as zero. These equations are explained as
follows.
3.2.1 Case 1 In this case, the template is updated as a weighted average of current template and
best match found in the image. The weight, γ, is calculated dynamically, as described
by Eq. (3.8). It is made the function of difference of peak correlations, ∆c, in the two
(When cp ≥ τ)
Algorithm 3.1 Proposed template updating method
Input: Current template, tn , previous template, tn - 1, Initial template, t1, Search window, previous peak correlation value cp
Output: Updated template
1. Initialize σ, δ, λ, and τ 2. Correlate tn , tn - 1, and t1 with the search window
and calculate cp(n), cp(n-1), and cp(1) respectively. 3. if cp(n) < cp(n-1) 4. tn ← tn-1 5. end if 6. if cp(1) < 0.50 7. tn ← ωtn + (1 - ω)tn-1 8. end if 9. oldcp ← cp 10. Correlate tn with the search window and
calculate current peak correlation value cp and the best match target candidate, bn
11. if cp ≥ τ 12. f ← 0 13. ∆cref ← 1 - cp 14. ∆c ← abs(cp - oldcp) 15. γ ← δ∆cref + (1 - δ) ∆c 16. tn ← tn + γ(bn - tn) 17. else 18. f ← f + 1 19. if f > λ 20. tn ← σ tn + (1 - σ) t1 21. end if 22. end if
36
Proposed Template Updating Method
latest frames, and the difference of peak correlation from its upper limit (which is 1),
∆cref,. For an object changing its appearance heavily and rapidly, ∆c, will be higher,
otherwise it will have smaller value. Thus, the template will be updated accordingly.
Ideally, updated template should have 100% or, at least, near 100% match in the next
image. This is achieved by ∆cref, term in Eq. (3.8). Furthermore, ∆cref, accelerates the
updating process which is normally slow due to very small value of, ∆c, in
consecutive image frames.
3.2.2 Case 2 In this case, we do not update the template considering that the template has been
occluded. If this case holds for a certain number of frames, λ, then it may be due to
the following reasons: (1) the template is poorly updated, therefore, it is consistently
failing to find good match in the frames, and (2) the target moved so fast that it had
gone outside the search window.
3.2.3 Case 3 In this case, the template is smoothly updated with the most trusted one, i.e., t1, to
solve the first issue mentioned in Section 3.2.2. In order to handle second issue,
search area of the template is iteratively increased. The pseudo code of the proposed
method is provided in Algorithm 3.1.
3.3 Results and Discussion The qualitative as well as quantitative results of the proposed template updating
method are shown on different challenging test videos such as Girl, Woman, Faceocc.
The videos are publicly available and can be downloaded from [197, 198] . Girl video
has 502 numbers of frames and it mainly contains the challenges of high appearance
Table 3.1 Description of test videos
Sequence # of frames Challenges involved
woman 552 Occlusions, appearance changes, pedestrian motion
Faceocc 887 Slow occurring long term occlusions, high appearance changes
Girl 502 3600 out of plane rotation, appearance change, occlusion
(When (cp < τ ) and ( f ≤ λ))
(When (cp < τ ) and ( f > λ))
37
Proposed Template Updating Method
changes and out of plane rotations, Woman video contains 552 frames and it provides
challenges of high appearance changes as well as heavy and long term occlusions,
Faceocc video comprises of 887 frames and it challenges the tracking algorithms with
slowly occurring heavy occlusions. Table 3.1 summarizes the description of these
videos. Edge-enhance based normalized correlation method [191], [44] have been
employed for object tracking.
3.3.1 Qualitative Analysis Figure 3.1, Figure 3.2, and Figure 3.3 show a few frames of Girl, Woman, and
Faceocc videos respectively. The first three rows in each figure are the results of
naive, α, and β methods, respectively, and the fourth row represents the result of the
proposed method. The current template is shown at the top-right corner of each frame.
The yellow rectangle in the figures shows the position of the best match of template in
the image. White rectangle in figures expresses that the tracker has lost the target and
is in prediction mode. The empirically determined parameter settings are as follows: α
Frame 27 Frame 101 Frame 133 Frame 211 Frame 267
Figure 3.1 Comparison of different updating schemes (i.e., Naive, α, and β methods shown in first three rows, respectively) with the proposed method
(i.e., forth row) for Girl video. The video involves two times out-of-plane rotation of the target (see Frames 101 and 211). The proposed method
updates the template better than any of these methods, and minimizes the template drift.
38
Proposed Template Updating Method
=cp (as suggested by Wong [190]), β = 0.15*cp, (as proposed by Ahmed et. al. [44]),
and τ = 0.70, δ = 0.3, λ = 3, ω = 0.25, and σ = 0.035. These parameter values are kept
same for all the videos.
In Figure 3.1, there is out-of-plane rotation of the target for two times (as
shown by Frames 101 and 211). Naive and alpha methods show almost similar
behavior (as expected) and start drifting the template at Frame 101. β-method does not
let the template drift at Frame 101 due to relatively slow adaptive rate but it fails at
Frame 211. In comparison, the proposed algorithm updates the template according to
its rate of change of appearance and keeps locking the target accurately without any
drift.
In Figure 3.2, a large part of the target (woman walking at foot-path) gets
occluded when it passes through behind the cars parked at the road side (see frames
119, 213, 291, and 380). Moreover, there are appearance changes, clutter and
Frame 1 Frame 119 Frame 213 Frame 291 Frame 380
Figure 3.2 Comparison of different updating schemes (i.e., Naive, α, and β methods shown in first three rows, respectively) with the proposed method
(i.e., forth row) for Woman video which contains occlusions, appearance change of the target, clutter and illumination change in the scene. It is clear
that the proposed method works better than the methods in comparison.
39
Proposed Template Updating Method
illumination variations in the scene (e.g., frames 119, 213). Both naive and alpha
updating methods show almost similar results and complete lose the target due to
template drift before Frame 119. The ß-method performs better, but it also lets the
template drift at Frame 291. The proposed method updates the template better than the
other methods and does not allow it to slide off the target throughout the video.
Figure 3.3 shows the results of the proposed algorithm on a few frames of
Faceocc video. The video involves slowly occluding face of the woman by a book
(e.g., Frames 189, 302, 412 and 466). Naive, alpha, and β methods for target updating
started drifting off at Frame 466. The proposed method updates the template in such a
way that minimizes its drift from the target area.
3.3.2 Quantitative Analysis For quantitative analysis, the difference between ground truth of the target center and
target center found by the individual algorithm, i.e., naive, alpha, and β methods, is
calculated. It is named as center location error. Figure 3.4, Figure 3.5, and Figure 3.6
Frame 1 Frame 189 Frame 302 Frame 412 Frame 466
Figure 3.3 Comparison of different updating schemes (i.e., Naive, α, and β methods shown in first three rows, respectively) with the proposed
method (i.e., forth row) for Faceocc video The proposed method successfully handles slow occurring long term occlusion.
40
Proposed Template Updating Method
show the center location error for Girl, Woman, and Faceocc video sequences in a
graphical way. The horizontal axis represents the frame number and the vertical axis
Figure 3.4 Center distance error between ground truth value and calculated value by naive, alpha, beta, and the proposed
template updating methods for Girl video. The template drift is much less by the proposed method.
Figure 3.5 Center distance error between ground truth value and calculated value by naive, alpha, beta, and the proposed template updating methods for Woman video. The template
drift is much less by the proposed method
41
Proposed Template Updating Method
shows the center distance between calculated value and ground truth of the targets. It
is clear from these figures that the proposed method has the lowest error (shown in
black color) as compared to the other methods. Error! Reference source not found.
summarizes the graphical results by showing the mean distance of each target for each
video sequence, which also shows that the proposed method, on the average, has
significantly less center errors in comparison to the other methods.
3.4 Chapter Summary This chapter presented a new template updating method for correlation based tracking
algorithms. The proposed method updates the template according to the rate of
Figure 3.6 Center distance error between ground truth value and calculated value by naive, alpha, beta, and the proposed template updating methods for Faceocc video. The template
drift is much less by the proposed method.
Table 3.2 Mean center location error for test video sequences
using naive, α, β, and the proposed template updating methods.
Naive method α-method β-method Proposed method
Girl 54.216 53.711 47.157 21.427
Woman 109.590 129.219 60.589 2.353
Faceocc 27.146 20.523 48.944 11.066
42
Proposed Template Updating Method
appearance changes of the target. This way, the template incorporates the maximum
changes of the target and minimum background of the scene. Thus, it avoids the
template drift problem and performs better in case of occlusion and complex motion
of the target. Edge-enhanced normalized correlation based tracking scheme has been
employed. The proposed method may also be used with other similarity measures and
tracking algorithms. Experimental results for different challenging publicly available
videos show the efficacy of the proposed algorithm in comparison with three other
template updating methods.
43
4 Proposed Visual Tracking Method
Correlation tracker is computation intensive; its efficiency depends on the size of the
search space and template. Moreover, it suffers from the template drift problem, and
may fall short in case of fast maneuvering target, rapid variations in its appearance,
occlusion faced by it and clutter in the background. In order to address these
problems, Kalman filter (KF) can be employed. KF predicts the target coordinates in
the next frame based on the measurement vector yielded by a correlation tracker. This
way, a relatively small search space can be defined around the position where the
target in the next frame is more likely to be found. Thus, the tracker can become
efficient and discard the clutter which is outside the search space in the scene.
However, if the tracker produces wrong measurement vector due to the clutter or the
occlusion inside the search space, the performance of the filter is considerably
depreciated. This chapter proposes the solution to this problem by incorporating the
mean shift method in the tracking framework. The mean shift tracker is fast and has
shown good results in the literature, but it fails when the histograms of the target and
the candidate region in the scene are similar (even when their appearance is different).
In order to make the overall visual tracking framework robust to the aforementioned
problems, the three approaches, i.e., correlation, KF, and mean shift, are combined,
heuristically, in such a way that they strengthen each other’s weakness for robust
tracking results. The template updating method presented in Chapter 3 has been used
in the proposed tracking framework. Furthermore the framework uses novel methods
for (1) adaptive threshold for similarity measure, which uses the variable threshold for
each upcoming image frame based on the peak similarity value of the current frame
with the template, and (2) adaptive kernel size for fast mean shift algorithm based on
varying size of the target.
4.1 Related Work A brief summary of different tracking techniques can be studied in Chapter 2. This
section discusses a bit more and relevant to this chapter, the tracking algorithms
related to correlation, Kalman filter and mean shift algorithms.
44
Proposed Visual Tracking Method
Different correlation based similarity measuring metrics, e.g., phase
correlation, normalized correlation, and normalized correlation coefficient, are used
for visual tracking. Phase correlation has been used by [199], [127], [200] for image
registration and tracking, but it is not robust to noise [130] and sometimes produce
higher peaks at wrong positions [131], [132], [44]. This problem was overcome by
[190] by using edge-enhanced image instead of a grayscale image. Ahmed et al. [135]
used extended flat top Gaussian weighting function with grayscale image to handle it.
Some other papers such as [201], [133], [134] also propose algorithms to enhance the
performances of phase correlation. All these methods do not produce as much good
tracking results in case of variations in appearance, shape, and brightness, etc., as
normalized correlation does [202], [190], [44]. Normalized Correlation Coefficient
(NCC) is another widely used similarity measure for object localization [136], [126],
[128], [203], [137]. NCC imposes the constraint of non-uniformity on template and
search window. The issue of occlusion handling using NCC was tackled in [193] with
the help of Kalman filter. It checks the value of NCC against an empirically
determined threshold, if NCC is less than the threshold, it is considered as occlusion
and next position of the target is calculated by the Kalman filter. A similar technique
for occlusion handling was used in [44], [191] with normalized correlation which is
computationally more efficient than NCC in the spatial domain and does not restrict
the template as well as search window to be non-uniform. It was shown in [202] that
normalized correlation produces better results than NCC when edge-enhanced image
is used for matching instead of grayscale images. Ali et al. [36] combined NCC with
Kalman Filter and Fast Mean Shift to handle complex object motion; but their
approach was not robust against clutter and occlusion.
Baleznai et al. used fast mean shift algorithm for the detection of humans in
groups [29], [28]. They further extended their work to track humans [27] [26], [25] .
Wang et al. [57] used multi-cue fusion based mean shift algorithm to track a human in
infrared imagery. Sutor et al. [204] presented efficient mean shift clustering to detect
and track humans. Shan et al. [52] proposed mean shift embedded particle filter for
hand tracking. Yilmaz et al. [205] used mean shift with motion compensation to track
target in Forward Looking Infra-Red (FLIR) imagery. Comaniciu et al. [206]
employed color histogram for real-time visual object tracking of non-rigid objects
using mean shift. They used Bhattacharya coefficient as a similarity metric to find out
45
Proposed Visual Tracking Method
the candidate target, obtained by the mean shift algorithm, that is the most similar to
the target. Afterwards, Comaniciu and Ramesh [9] combined mean shift and KF for
object tracking based on color histogram. They used mean shift iterations to get the
best candidate target and KF is used for next target position in the upcoming image
frame. As the next frame arrives, mean shift is initialized at the target position
predicted from the previous frame. Li et al. [32] suggested adaptive KF with mean
shift for object tracking. It adaptively updates the parameters of KF as opposed to
previous techniques that keep KF parameters constant. Similar to [206] and [9], color
histogram based target representation is considered in [32]. Since the color histogram
does not carry spatial information of pixels [4], it is likely to detect a wrong target
with similar histogram as that of the target [44]. Therefore, the idea of heuristically
combining correlation, Kalman filter and adaptive kernel fast mean shift algorithm for
better visual tracking results is proposed in this chapter.
4.2 Proposed Visual Object Tracking Framework The proposed VOT algorithm proposes a method which combines the strength of
three basic trackers, i.e., correlation, KF, and mean shift. Except this, the other
contributions of the proposed tracking method include: (1) novel approach of template
updating (described in Chapter 3), (2) adaptive thresholding, and (3) adaptive kernel
fast mean shift algorithm. The detail of the proposed tracking framework and each of
its components is given as follows.
4.2.1 Correlation and KF based Tracking In order to initiate the correlation based tracking process, the target is initially selected
by a user. The sub-image representing the target is called its template. In the proposed
method, Edge-Enhanced (E2) representation of the template has been employed as the
target appearance. The search window of the template is also made E2. The edge-
enhancement is a four step process consisting of Gaussian smoothing, calculation of
gradient magnitude, normalization of intensity, and thresholding. Interested readers
may study [44] for further detail of these steps. The size of the search window is not
kept constant rather it is dynamically adjusted with the help of KF throughout the
tracking session. Thus, tracker becomes computationally efficient, gets rid of clutter
outside the search space and earns better tracking results. Detail about the dynamic
search window can be studied from [191]. Variations in the size of the target in image
46
Proposed Visual Tracking Method
plane are handled by following two processes: One is, by correlating the original
template, its smaller, and larger versions, i.e., 0.90 and 1.10, with the search space.
The size of the template which has the highest correlation value is considered for
matching in the next image frame. The same technique has been proposed for scale
handling in many other papers such as [17], [207], [191], [95]. The limitation of the
technique is that it works in discrete steps, i.e., 10% scale change, on full template,
therefore, it does not work well if the template is required to change its size in a
particular direction. Therefore, the other technique, i.e., Best Match Rectangle
Adjustment (BMRA) algorithm [208] is used to resize the template according to target
size and to keep the target at the center of the template. BMRA divides the template
into nine non-overlapping patches and calculates the energy of each patch. The
majority voting scheme is used for adjustment of the best match rectangle. This way,
it keeps the target at the center and tackles the problem of template drift, especially in
case of tracking an airborne object such as an airplane, flying kites, birds, helicopter,
etc. The detail of BMRA algorithm can be studied from [208]. After deciding the size
to of the template, it is matched in the search window using normalized correlation
and the spatial location of the peak correlation value is considered as the current
position of the target in the search window. The matching is considered successful if
the peak value of the normalized correlation is greater than a threshold. The
normalized correlation is used for similarity measure in the proposed tracking method
Algorithm 4.1 Correlation and Kalman filter tracking
Input: Video sequence of n frames, template image of the target, t, and target bounding rectangle in the1st frame, r Output: target position in each frame of the video sequence
for 1st frame to n frames 1. Make the template, t, edge enhanced 2. Extract search window, s 3. Make the search window, s, edge enhanced 4. Match T with S using normalized correlation (NC) 5. cp ← max(NC) 6. Update size of the T 7. Occlusion handling using Kalman filter 8. Update t 9. Output bounding rectangle of T according to cp end for
47
Proposed Visual Tracking Method
because it works relatively better for object localization in case of edge enhanced
images [202]. The next step after matching is to update the template. It is already
explained in Chapter 3. Algorithm 4.1 summarizes the correlation and KF based
tracking methodology.
4.2.2 Adaptive Threshold The fixed threshold method is used in many papers, e.g., [44], [191], [193], [36],
which sets a fixed value for all frames of a video and does not take into account any
local information obtained through correlation surface at each image frame.
Therefore, it always puts the same criteria at every image frame without bothering
about the scene and target dynamics. Thus, the method is highly probable to fail in
case of fast maneuvering target changing its appearance heavily and rapidly. The peak
correlation value provides clues about changes in target, therefore, it may be used as
heuristic information to introduce adaptability in the threshold; e.g., if the current
peak value of normalized correlation is 0.85, it means that this value may drop more
in the next image frame. So, the threshold value should be set well below the current
peak correlation value for upcoming frames. This way, the scheme uses local
information about target matching score to set its threshold at each frame instead of
using the global value for all the frames. To avoid the possibility of too low threshold
value to be accepted as good matching, a lower limit is put on adaptive threshold.
Mathematically, the process is described by Eq. (4.1).
if
otherwisep l
l
c ψ τ ττ
τ
− ≥=
(4.1)
Algorithm 4.2 Adaptive threshold
Input: current threshold, τ, peak correlation value, cp Output: updated threshold
Initialize τ l and ψ if τ ≥ τ l τ ← cp - ψ else τ ← τ l endif
48
Proposed Visual Tracking Method
where 0.10 ≤ ψ ≤ 0.17 and 0 < τl < 1, i.e. it is being assumed that the target
may change its appearance by at most 17% in the next image frame. This limit is
found empirically and it works well for slow and fast maneuvering object changing its
appearance slowly or rapidly. The Pseudo code of the adaptive threshold method is
presented in Algorithm 4.2.
4.3 Occlusion Handling with Kalman Filter When the target is hidden, completely or partially, by another object in the scene, it is
said that the occlusion has occurred. It is a vital task for all visual object tracking
algorithms to handle this situation. The peak correlation value may be used as an
occlusion indicator because its value drops as target suddenly get occluded by another
object. As its value becomes less than the threshold, we stop updating the template
and assume that the target coordinates provided by the correlation tracker are no more
trust worthy. Previously Kalman filter predicted position is considered as the current
position of the target and Kalman filter is updated according to its own prediction.
The value of the threshold is iteratively reduced. It is due to the fact that changes in
target during occlusion are not incorporated in the template, so the peak correlation
value may drop below the threshold. Moreover, size of the dynamically created search
window is made larger in each iteration in order to take into account the possible
variations in the direction and speed of the target during occlusion. The template is
correlated with the search window at each image frame, but the tracker remains in a
Kalman mode (i.e., the bounding box for the target is decided by Kalman predicted
coordinates) till best-match score exceeds the threshold. Algorithm 4.3 sums up these
steps.
Algorithm 4.3 Occlusion handling with Kalman filter
1. Consider previously Kalman Filter predicted position as current position of the target.
2. Kalman filter is updated according to its own prediction in the previous iteration.
3. Template is not updated during occlusion. 4. Value of threshold is iteratively reduced. 5. Size of dynamically created search window is made larger and larger at
each iteration.
49
Proposed Visual Tracking Method
4.4 Adaptive Fast Mean Shift Algorithm Mean shift is used for segmentation and tracking due to its clustering and mode
seeking capability. It is an iterative algorithm which starts by considering a random
point as its center finds mean value in its neighborhood and shift the center point to
the newly found mean position. The process ends up when change in position is
extremely small or maximum number of iterations is reached. Mathematical detail of
mean shift is simple and it is easy to apply to images [29]. In order to find the
weighted mean of data points, a kernel function is used to assign weights to each data
point. In case of uniform kernel, integral image is calculated for fast calculation of
mean shift [29]. Difference of two consecutive frames usually shows moving regions;
the regions can be considered as potential candidates for target in tracking scenario.
The mean shift technique can be used to find these regions in different images.
Beleznai et. al. Exploited fast mean shift approach with a uniform kernel for human
detection and tracking [25-29]. The same technique is adopted in this thesis, but the
novelty is introduced in it by making the size of kernel adaptive at each frame. The
size of the kernel is made equal to the size of the template. The template size is made
adaptive by the following two methods: (1) correlating the original template as well as
10% smaller and 10% larger templates with the search space. The size of the template,
which provided the highest peak correlation value, is considered as the new template
size [17], [207], [191], [95]. (2) Best Match Rectangle Adjustment (BMRA)
algorithm is used to resize the template according to the target size and keep the target
at the center of the template. It divides the template into nine non-overlapping
fragments and checks the energy contents in each fragment. A voting scheme is used
Algorithm 4.4 Adaptive fast mean shift algorithm
1. Calculate difference of search windows. 2. Calculate size of the template in the current image by BMRA as well
as correlating 10 % larger and 10 % smaller template with the search window.
3. The size of the Kernel is set as the size of the template calculated in step 2.
4. Apply fast mean shift algorithm with the difference image calculated at step 1 and the rectangular Kernel calculated at step 3.
50
Proposed Visual Tracking Method
for adjustment of best match rectangle [208]. Moreover, we compute the difference of
search windows instead of full frames. The size of the both search windows is kept
same and their difference is obtained by subtracting the previous search window from
the current one. In this way, too many moving regions and outliers in difference
image can be avoided. Furthermore, the process becomes more computationally
efficient, because the mean shift is now calculated in the search window only.
Algorithm 4.4 summarizes these steps.
4.5 Combining Correlation, Kalman Filter and Adaptive
Kernel Fast Mean Shift Algorithms Kalman filter is a measurement follower algorithm. It predicts the position of the
target in next frame based on its position (determined by the correlation tracker) in the
current and the previous frames. It works in prediction-correction cycle, i.e., it
predicts the next position of the target and corrects itself by exploiting the actual
position of the target. During steady state, its accuracy is determined by the closeness
of its predicted value with the measured value at each image frame. When the
difference between predicted and measured values gets larger than a threshold, it
indicates an alarming situation for tracking scenario. It may be due to any of
following reasons: (1) correlation tracker provided wrong measurement due to clutter,
Algorithm 4.5 Combining correlation, Kalman filter and adaptive fast mean shift algorithms
1. Calculate difference between measured and predicted target position at each image frame.
2. If the difference is greater than a threshold, get the difference search window by subtracting the current search window from the previous one.
3. Apply the proposed adaptive fast mean shift algorithm in difference search window and find the position of potential candidate for the target, i.e. the candidate with the highest correlation value with the template.
4. Check whether the position calculated in step 3 is the nearest neighbor of measured value or the predicted value. If it is measured value, we consider it correct position, otherwise, confidence is given to the predicted value.
5. Template is not updated . 6. Area of search window is iteratively increased to avoid the possibility
of getting the target out of the search window.
51
Proposed Visual Tracking Method
blurriness, occlusion, out-of-plane rotation of the target, or any other issue in the
search window, or (2) target has suddenly changed its direction (e.g., the target may
be moving back and forth briskly); correlation measurement is correct one in this
case. The problem becomes worse when there is no significant decrease in peak
correlation value, i.e., no indication of occlusion. In order to tackle this issue and to
Figure 4.1 Proposed Tracking Algorithm
52
Proposed Visual Tracking Method
decide whether to follow the Kalman Filter prediction or correlation tracker
measurement, an algorithm is proposed in this chapter, which combines the strengths
of correlation, Kalman filter and adaptive kernel fast mean shift algorithm. For this,
the difference between the measured and the predicted target position in each image
frame is calculated. If the difference is greater than a threshold (which is template
size), the difference of the current and the previous search window is calculated, and
adaptive fast mean shift algorithm is applied on the difference search window to find
the position of the potential candidate for the target. It is checked whether the
measured or the predicted target position is closer to this value (i.e., mean shift
calculated value). If it is the measured one, we consider it the correct position of the
target, otherwise, the predicted position is considered correct. Moreover, the template
Table 4.1 Description of dataset
Sequence # of frames Challenges involved
Faceocc2 812 Slowly occurring heavy occlusions, high appearance changes
ThreePastShop2Corr2 351 Similar objects, Heavy occlusion, appearance and scale changes
Woman 552 Occlusions, appearance changes
Car11 393 Low light conditions
David 462 Illuminations changes, appearance changes
Singer 351 Illuminations changes, scale changes
Board 698 3D motion, cluttered background
Box 1161 Fast 3D motion, occlusions, motion blur, cluttered background, scale changes
Liquor 1741 Fast 3D motion, occlusions, motion blur
Faceocc 887 Slow occurring long term occlusions, high appearance changes
Girl 502 3600 out of plane rotation, appearance change, occlusion
53
Proposed Visual Tracking Method
is not updated in this case and the area of the search window is increased iteratively so
that the possibility of the target going out of the search window may be avoided.
Algorithm 4.5 presents these steps briefly and Figure 4.1 shows a flow chart of the
proposed tracking method.
4.6 Results and Discussion This section discusses: (1) data set used to evaluate the proposed tracking algorithm,
(2) methods used for analysis of the tracking algorithm, (3) the effect of different
values of ψ for adaptive threshold on tracking results, (4) the comparison of the
proposed method with its base trackers, i.e., correlation tracker, and correlation and
KF tracker, and (5) the comparison of the proposed tracking strategy with nine state-
of-the-art tracking methods on different publicly available videos.
Table 4.2 Pascal score on test video sequences with different values of ψ
Value of ψ 0.1
0.12 0.14 0.16 0.17
Sequence
Faceocc2 1.00 1.00 1.00 1.00 1.00
Caviar 0.629 0.894 0.211 0.211 0.731
woman 0.091 1.00 1.00 1.00 1.00
Car11 1.00 1.00 1.00 1.00 1.00
David 1.00 1.00 1.00 1.00 1.00
Singer 0.277 1.00 1.00 1.00 1.00
Board 0.312 0.75 0.77 0.78 0.78
Box 0.927 0.923 0.884 0.901 0.901
Liquor 0.831 0.954 0.736 0.711 0.562
Faceocc 0.983 0.933 0.865 0.865 0.865
Girl 0.663 0.891 0.743 0.743 0.743
54
Proposed Visual Tracking Method
4.6.1 Data Set Eleven publicly available challenging videos have been used for different
experimentation to show the robustness of the proposed algorithm. The videos are
Girl, Faceocc, Faceocc2, ThreePastShop2Cor2 (from Caviar dataset), Woman,
Car11, David, Singer, Board, Box, Liquor. Several researchers have used these videos
for benchmarking their algorithms in recent years [3, 4, 42, 53, 64, 65, 184, 209]. So,
the videos may be considered as de-facto standard for tracking algorithm evaluation.
Girl, Faceocc, Faceocc2, and David videos can be downloaded from [197], Board,
Box, and Liquor videos are available at [210], and Woman, ThreePastShop2Cor2,
Singer, Car11 videos can be downloaded from [198, 211-213], respectively. Table 4.1
provides description of these videos.
Table 4.3 Mean distance error on test video sequences with different values of ψ
Value of ψ
0.1 0.12 0.14 0.16 0.17
Sequence
Faceocc2 9.463 9.463 9.463 9.463 9.463
Caviar 43.131 4.963 66.539 66.315 24.805
woman 111.211 2.353 2.353 2.353 2.353
Car11 1.559 1.559 1.559 1.559 1.559
David 6.079 6.079 6.079 6.079 6.079
Singer 88.129 2.630 2.630 2.630 2.630
Board 75.571 34.960 33.524 33.125 33.125
Box 10.703 12.122 13.818 13.130 13.130
Liquor 35.566 20.469 62.426 63.640 73.473
Faceocc 6.357 11.066 17.321 17.321 17.321
Girl 40.236 21.428 25.017 25.017 25.017
55
Proposed Visual Tracking Method
4.6.2 Analysis for Proposed Tracking Algorithm The proposed algorithm is analyzed qualitatively as well as quantitatively. For
qualitative analysis, sample tracked frames of the proposed method are shown and
compared visually with the results of benchmark algorithms. The processed frames, in
Figure 4.2 Comparison of results for simple correlation tracker, correlation
and KF tracker, and adaptive fast mean shift embedded with correlation and KF tracker for ThreePastShop2cor video (from Caviar dataset). It
proves the claim that adding mean shift approach with correlation and KF tracker (in the proposed way) improves the results.
Figure 4.3 Comparison of results for simple correlation tracker,
correlation and KF tracker, and adaptive fast mean shift embedded with correlation and KF tracker for Liquor video. It proves the claim that adding mean shift approach with correlation and KF tracker (in the
proposed way) improves the results.
56
Proposed Visual Tracking Method
which the tracked rectangle is close to the target of interest, are considered as visually
better results. Quantitative solution is calculated to have a better understanding of the
robustness of the proposed algorithm. For this, two measures have been used, one is
the mean distance from center location, it provides the error between center location
of tracked rectangle and its ground truth value, and the other is Pascal VOC criteria
[214], it outputs the number of correctly tracked frames. Pascal score can be
computed using Eq. (4.2):
( )( )
t g
t g
area R Rs
area R R∩
=∪
(4.2)
where Rt is tracked rectangle, and Rg is its ground truth. A frame is considered
as correctly tracked if s > 0.5.
4.6.3 Adaptive Threshold with Different Parameter Values Value of ψ plays a pivot role in choosing the adaptive threshold. Various experiments
with different values of ψ in the range of 0.10 to 0.17 have been performed to
calculate The value of τl is set 0.65. Pascal score and mean distance error of all test
video sequences. The results are summarized in Table 4.2 and Table 4.3, respectively.
It can be concluded from these results that ψ = 0.12 provides better results for most of
the test video sequences.
Figure 4.4 Comparison of Pascal score of correlation KF tracker with and without adaptive fast mean shift algorithm
57
Proposed Visual Tracking Method
4.6.4 Comparison Of Proposed Tracking Method with Its Constituents
In this Section, three tracking algorithms are compared, i.e., (1) simple correlation
tracker, (2) correlation and KF based tracker, and (3) the proposed correlation, KF,
and adaptive fast mean based tracking algorithm. This way, the claim that
heuristically switching (with the help of mean shift approach) between correlation
based measured and KF based predicted target coordinates makes the tracking robust,
can be examined. The proposed adaptive threshold (discussed in Section 4.2) and
template updating methods (described in Chapter 3) have been used. Figure 4.2 and
Figure 4.3 show the center location error for ThreePastShop2Corr2 and Liquor
videos, respectively. It is clear from Figure 4.2 that the occlusion occurring during
frames 107 to 130 is not handled by simple correlation tracker and produces the mean
center error is 13.196, KF helps in this situation and reduces the average center
distance to 9.266, the performance of correlation-KF tracker improves significantly
when embedded with adaptive fast mean shift approach with the average score center
distance to 4.963. Similar situation can be seen in Figure 4.3 for Liquor video with
mean center errors of 57.239, 55.492, and 20.469 for these three approaches,
respectively. The occluded region is marked in Figures 4.2 & 4.3 by a downward
directed arrow with the label of occlusion. In order to elaborate the advantage of
integrating adaptive fast mean shift approach in correlation-KF tracker, mean distance
error and Pascal score is calculated on all the test videos with and without adaptive
Figure 4.5 Comparison of mean distance error of correlation KF tracker with and without adaptive fast mean shift algorithm
58
Proposed Visual Tracking Method
fast mean shift algorithm as shown in Table 4.4. It is clear from the table that
integration of mean shift approach into correlation KF tracker significantly improves
the results. Figure 4.4 and Figure 4.5 summarize these results for Pascal score and
mean distance error, respectively.
4.6.5 Performance Comparison of Proposed Tracking Methods with Other Methods
Tracking results of the proposed tracking method are compared with the nine state-of-
the-art tracking methods, namely, incremental visual tracking (IVT) [14], l1 tracker
[19], PN learning [215], visual tracking decomposition (VTD) [42], MIL tracker [3],
FragTrack [4], local sparse appearance model (LSAM) [64], PROST [53], and EENC
tracker [44, 191]. Already cited results of these trackers from the papers [53, 64] are
mentioned in this thesis. Therefore, if the result on a certain video is not found, it is
not mentioned (except EENC tracker, it was run on all the videos).
Table 4.4 Comparison of correlation KF tracker with and without adaptive fast mean shift algorithm
Sequence Pascal VOC score Mean distance error
Without mean shift
With mean shift
Without mean shift
With mean shift
Faceocc2 0.939 1.00 14.014 9.463
Caviar 0.477 0.894 9.266 4.963
Woman 1.000 1.00 2.353 2.353
Car11 1.000 1.00 1.608 1.559
David 0.645 1.00 15.735 6.079
Singer 1.000 1.00 3.035 2.630
Board 0.064 0.75 206.746 34.960
Box 0.335 0.923 213.631 12.122
Liquor 0.756 0.954 55.492 20.469
Faceocc 0.908 0.933 16.961 11.066
Girl 0.713 0.891 23.282 21.427
59
Proposed Visual Tracking Method
Table 4.5 summarizes the results of mean center location error in pixels and
Table 4.6 shows the mean Pascal score. First row of both tables shows the name of the
algorithm and its publication year. The best result for each video is shown in bold-
underline, the second best is in italic-underline and the third best result is in italic
format. The last row of each table shows the average score of the algorithms for all 9
videos. It is clear from the tables that the proposed algorithm, overall, performs better
than each of the other algorithms. The proposed tracking algorithm was implemented
using OpenCV on Core i5 machine with 4 GB RAM. The number of frames
processed per second (fps) depends upon the size of the template and search window.
The normalized correlation is calculated in Fourier domain or spatial domain
depending on the sizes of the template and the search window for fast processing. The
adaptive fast mean shift is also efficient as compared to the original mean shift
algorithm due to usage of the integral histogram technique. Furthermore, it is
calculated only when there is no overlap between predicted and measured target
Table 4.5 Mean center location error for video sequences of dataset
IVT
(2008)
L1
(2009)
PN
(2010)
VTD
(2010)
MIL
(2011)
FragTrack
(2006)
LSAM
(2012)
EENC
(2008)
PROST
(2010)
Proposed
method
Faceocc2 10.2 11.1 18.6 10.4 14.3 15.5 3.8 41.309 17.2 9.463
Caviar 66.2 65.9 53.0 60.9 83.9 94.2 2.3 91.867 -------- 4.963
woman 167.5 131.6 9.0 136.6 122.4 113.6 2.8 104.549 -------- 2.353
Car11 2.1 33.3 25.1 27.1 43.5 63.9 2.0 2.332 -------- 1.559
David 3.6 7.6 9.7 13.6 15.6 46.0 3.6 17.418 15.3 6.079
Singer 8.5 4.6 32.7 4.1 15.2 22.0 4.8 15.589 -------- 2.630
Board 165.5 177.0 97.0 96.1 51.2 90.1 7.3 165.347 37.0 34.960
Box ------ 196.0 ------ ------ 104.6 57.4 -------- 117.866 12.696 12.122
Liquor ------ ------ ------ ------ 115.1 30.7 -------- 100.733 21.487 20.469
Faceocc ------ ------ ------ ------ 18.4 6.5 -------- 48.641 7.0 11.066
Girl 48.5 62.5 23.2 21.5 31.5 26.5 -------- 53.711 19.0 21.427
Average 59.013 76.622 33.537 46.287 55.973 51.491 3.8 69.033 18.526 11.554
60
Proposed Visual Tracking Method
coordinates, or the peak correlation value is less than the threshold. On the average,
the whole algorithm runs in real-time (i.e., 25 fps).
Figure 4.6, Figure 4.7, Figure 4.8, Figure 4.9, Figure 4.11, and Figure 4.10
depict graphically the center distance error and Pascal score for the Box, Board, and
Liquor videos for each fifth frame (as ground truth is available for only these frames).
The graphs show the results of the proposed, EENC, MIL, PROST, and FragTrack in
blue, red, green, black, and magenta colors, respectively. It is evident from these
Figures that the proposed method outperforms all the other methods.
Figure 4.12 shows the performance of the proposed tracking method for Box
video during occlusions (e.g., Frames 297, 486), scale changes, complex target
motion including its 3D rotation creating motion blur in cluttered background (for
example, Frame 555 and 928).
Table 4.6 Pascal VOC score for video sequences of dataset
IVT
(2008)
L1
(2009)
VTD
(2010)
PN
(2010)
MIL
(2011)
FragTrack
(2006)
LSAM
(2012)
EENC
(2008)
PROST
(2010)
Proposed method
Faceocc2 0.59 0.84 0.59 0.49 0.96 0.60 0.82 0.515 0.82 1.00
Caviar 0.21 0.20 0.19 0.21 0.19 0.19 0.84 0.309 ------- 0.894
woman 0.19 0.18 0.15 0.60 0.16 0.20 0.78 0.182 ------- 1.00
Car11 0.81 0.44 0.43 0.38 0.17 0.09 0.81 0.886 ------- 1.00
David 0.72 0.63 0.53 0.60 0.70 0.47 0.79 0.742 0.80 1.00
Singer 0.66 0.70 0.79 0.41 0.33 0.34 0.74 0.246 ------- 1.00
Board 0.17 0.15 0.36 0.31 0.679 0.679 0.74 0.136 0.75 0.75
Box ------ 0.05 ------ ------ 0.245 0.614 ------- 0.506 0.914 0.923
Liquor ------ ------ ------ ------ 0.206 0.799 ------- 0.504 0.854 0.954
Faceocc ------ ------ ------ ------ 0.93 1.00 ------- 0.449 1.00 0.933
Girl 0.42 0.32 0.51 0.57 0.70 0.70 ------- 0.287 0.89 0.891
Average 0.471 0.390 0.444 0.446 0.479 0.516 0.789 0.433 0.861 0.940
61
Proposed Visual Tracking Method
Figure 4.14 shows a few tracked frames for Liqour video sequence. The
proposed algorithm successfully tracks the target during occlusions (as shown in
Frames 360, 607, 776, 1115, 1183, 1236, 1319, 1355, 1438, and 1462) and 3600
rotation causing motion blur (e.g., frames 1404, 1407).
Figure 4.13 explains that the proposed algorithm successfully handles the out-
of-plane rotation of target in the cluttered background for Board video.
Figure 4.15 (Car11 video frames) shows that the proposed algorithm tracks
the target in low light conditions
Figure 4.16 shows a few frames of David video. The proposed algorithm
handles varying illumination conditions (e.g., Frame 1 and 25), complex target motion
(e.g., Frame 160), and target appearance changes (e.g., Frame 383).
Figure 4.17 depicts the results of the proposed algorithm on Faceocc2 video.
The video contains large appearance changes (for example, Frame 19 and Frame 577)
and slowly occurring heavy occlusions (e.g., more than 90% of the face is occluded as
Figure 4.6 Center distance error for Box video sequence
62
Proposed Visual Tracking Method
shown in Frame 720). The proposed template updating and tracking strategy keeps
locking the target successfully.
Figure 4.19 shows some frames of Singer video. The proposed algorithm
successfully handles the high illumination effects on the target (e.g., Frame 115) and
large change in its scale (e.g., Frame 333).
Figure 4.18 shows some of the frames from ThreePastShop2Cor2 video of
Caviar dataset. The video contains similar objects, which makes it difficult to track
the target. The situation becomes worse due to the occlusions of other objects with the
target (e.g., Frames 83 and 120). The proposed method shows prominent results and
successfully tracks the target.
4.7 Chapter Summary Correlation based methods have been in use since very start of visual tracking field
[56], [103], [138], [216], and it has shown its strength for long term tracking session
[65], [191], but classically, there are a few inherent issues with this approach, which
are as follows: (1) It is computation intensive, (2) It has template drift problem, and
Figure 4.7 Pascal Score for Box video sequence
63
Proposed Visual Tracking Method
(3) it may fail in case of fast maneuvering target, rapid changes in its appearance, or
occlusion and clutter in the scene. These issues are handled, to some extent, by
integrating KF with the correlation based tracking and temporarily updating the
template [202], [193]. Considering the position of peak correlation value as the
position of target in the current image frame, KF predicts its position in the upcoming
image frame. Thus, a relatively small search window can be determined where the
occurrence of the target is highly likely [44]. Moreover, KF gets tracker out of
occlusion faced by the target. Occlusion is assumed to be happening if the correlation
value of the target in search window falls below a threshold. Therefore, choosing the
right value of the threshold is very important.Many papers [190], [36], [191], [44],
[193] use fixed threshold, but it does not work as the complexity of scene changes. A
new method for adaptive threshold based on the current frame peak correlation value
is proposed in this chapter. During occlusion, correlation based measurement vector is
ignored and KF predicted vector is used as the next measurement vector. This way,
the tracker becomes (1) fast, (2) its performance remains safe from a lot of clutter
outside the search window, and (3) it shows robustness to occlusion as well. However,
due to all or any of the above mentioned issues occurring inside the search window,
the tracker may provide wrong measurements to KF which in turn generates wrong
Figure 4.8 Distance Score for Board video sequence
64
Proposed Visual Tracking Method
predictions and whole tracking process is deteriorated. Now the question arises, how
to get to know automatically that this situation has happened? In order to answer this
question, the difference between KF predicted and correlation based measured
coordinates is calculated; and is checked against another adaptive threshold based on
the target size. The next question that comes, intuitively, in mind is whether the
tracker should go with the predicted or the measured coordinates? Adaptive fast mean
shift algorithm is used to answer this question. It is applied to find out the clusters in
difference of the search windows of two consecutive frames. These clusters are
moving regions in the video, thus they become potential candidates for being the
target. The Nearest Neighborhood technique is used to check whether a candidate
target is close to predicted or measured coordinates. In such a way, the fast mean shift
algorithm acts as an arbitrator for the validity between KF and correlation based
results. Thus, KF can be protected from being misled by the wrong measurement
vector. The size of the kernel for mean shift is set adaptively according to the
changing target size. To tackle the issue of rapid change in target appearance, a novel
method is proposed which updates the target model according to rate of appearance
change of the target. In general , the proposed tracking strategy can be considered as
an ensemble of the three techniques, complementing each other in complex situations.
Figure 4.9 Pascal Score for Board video sequence
65
Proposed Visual Tracking Method
The switching from one technique to another technique is decided heuristically as
described above.
Figure 4.11 Distance Score for Liquor video sequence
Figure 4.10 Pascal Score for Liquor video sequence
66
Proposed Visual Tracking Method
Frame 297 Frame 486 Frame 555 Frame 928
Figure 4.12 Sample tracked frames of Box video sequence. The proposed algorithm successfully tracks the target during occlusions, scale changes, 3D
motion causing blurriness, and clutter background.
Frame 360 Frame 607 Frame 776 Frame 1115
Frame 1183 Frame 1236 Frame 1319 Frame 1355
Frame 1438 Frame 1462 Frame 1504 Frame 1517
Figure 4.14 A few tracked frames of Liquor video sequence. The proposed approach successfully tracks during occlusions, 3D motion causing blurriness,
and background clutter.
Frame 20 Frame 156 Frame 545 Frame 599
Figure 4.13 Results for Board video sequence. The proposed algorithm successfully handles the out of plane motion of the target in cluttered
background.
67
Proposed Visual Tracking Method
Frame 19 Frame 130 Frame 172 Frame 267
Frame 421 Frame 492 Frame 577 Frame 720
Figure 4.17 A few frames of Faceocc2 video sequence. The proposed algorithm tracks the target with large appearance changes and slowly occurring heavy
occlusions.
Frame 1 Frame 25 Frame 200 Frame 305
Figure 4.15 Frames of Car video sequence. The proposed algorithm successfully tracks the target in low light conditions.
Frame 1 Frame 25 Frame 160 Frame 383
Figure 4.16 Some frames from David video sequence. The proposed algorithm tracks the target in varying illuminations and appearance changes.
68
Proposed Visual Tracking Method
Frame 1 Frame 115 Frame 136
Frame 240 Frame 265 Frame 333
Figure 4.19 A few frames of Singer video sequence. The proposed algorithm successfully handles high illumination effects as well as large scale changes.
Frame 1 Frame 83 Frame 120 Frame 317
Figure 4.18 Some tracked frames from the sequence ThreePastShop2Cor2 (Caviar dataset). The main challenges in the video include the existence of
similar objects, and the occlusions which occur while the persons in the sequence cross each other. The proposed method successfully tracks the
target.
69
5 Stabilized Active Camera Tracking System
An active camera tracking system (ACTS) tracks a target with a moving video
camera. The system is illustrated as a section of a block diagram shown in Figure 5.1.
It consists of: (1) a video camera, (2) a visual tracking algorithm, (3) a pan-tilt control
algorithm, and (4) a pan-tilt unit (PTU). Every frame acquired from the video camera
is analyzed by the visual tracking algorithm, which localizes the target in the image in
pixel-coordinates. The coordinates are sent to the pan-tilt control algorithm which
rotates the PTU according to the motion of the object. Since the camera is attached to
the PTU, it also rotates in sync with the PTU. Thus, the tracked target is always
projected at the center of the video frames, regardless of whether the object is moving
or stationary.
If the ACTS is mounted on a vibrating platform such as truck, helicopter, ship,
etc., it is required to stabilize the video without affecting the efficiency of the system.
The purpose of the video stabilization is to filter out the annoying vibration from the
video to reduce the unnecessary strain on the eyes of the viewer.
A simplified block diagram of a stabilized ACTS is shown in Figure 5.1. The
VOT has been explained in Chapter 4. So, an introduction to the rest of the individual
algorithmic components of the system is provided as follows.
Visual TrackingAlgorithm
Pan-Tilt Control Algorithm
Video Frame
(x, y)Target Position
Pan-TiltUnit
Video Camera
Pan-Tilt Control Signals
Digital Video Stabilization
Monitor
Angle of View
Active Camera Tracking System
Figure 5.1 Simplified block diagram of the proposed stabilized active camera tracking system.
70
Stabilized Active Camera Tracking System
5.1 Pan-Tilt Control The camera in the stabilized ACTS is mounted on top of a PTU, so it moves in sync
with it. The PTU motion is controlled by a control algorithm. If the control is not
smooth and precise, the object in the video will oscillate to-and-fro from the center of
the frame, and in the worst case the object may get out of the field of view (FOV).
One approach is to use a classic proportional-integral-derivative (PID) controller
[217]. However, its design requires a mathematical model of the system. Besides, it
necessitates a sensitive and rigorous tuning of its three gain parameters (i.e.,
proportional, differential and integral) at all the zoom levels of the camera. An
alternative approach is to use a fuzzy controller [218, 219] that does not require the
system model, but choosing a set of right membership functions and fuzzy rules
calibrated for every zoom-level of the camera is practically very cumbersome.
Another alternative is to implement a neural network controller [220], but it is heavily
dependent on the quality and the variety of the examples in the training data set,
which can accurately represent the complete behavior of the controller in all possible
scenarios, including the varying zoom-levels of the camera. Moreover, the traditional
control algorithms, e.g. [221], are generally implemented based on the difference
between the center of the frame and the current target position in the frame. These
algorithms do not account for the target velocity. As a result, there will be oscillations
(if the object is moving slow), lag (if it is moving with a mediocre speed), and loss of
the object from the frame (if it is moving faster than the maximum pan-tilt velocity
generated by the control algorithm). Keeping in view the above-mentioned limitations
of the various control algorithms, a predictive open-loop car-following control (POL-
CFC) algorithm [44] is proposed to use for target tracking. Its basic idea is borrowed
from the car-following control (CFC) strategy [222]. The CFC assumes that the actual
velocity of the PTU is observable through a velocity sensor. However, the POL-CFC
does not make this assumption and simply considers that the current PTU velocity is
the previous velocity command sent to the PTU. Then, it computes the velocity of the
target relative to the PTU velocity from the predicted target positions provided by the
Kalman filter in the current and the next frame. Finally, it generates precise velocity
commands for the PTU to move the camera towards the target accurately in real-time.
Thus, the proposed control strategy is very useful for controlling a system, which does
not feedback its current velocity, such as stepper-motor PTU. Its performance is tested
71
Stabilized Active Camera Tracking System
on real-world scenarios and has proven to be adequately smooth, fast and accurate.
The POL-CFC algorithm in the proposed stabilized ACTS offers 0% overshoot, 0
steady-state tracking error, and 1.7 second rise-time at least for 1x to 6x zoom levels
of the camera.
5.2 Video Stabilization Video stabilization is the process of removing vibrations from the video. It has very
wide application spectrum ranging from consumer devices (e.g., handy-cams, mobile
phone with video camera, etc.) to state-of-the-art military and defense systems, e.g.
the payloads for unmanned aerial vehicle (UAV) and unmanned ground vehicle
(UGV) [223], etc. There are many hardware as well as software solutions available
for video stabilization with their own merits and demerits depending upon their
applications. There are two types of motion when the camera is mounted on a PTU.
One is valid motion that comes due to the motion of the object to be tracked. The
other is the annoying motion that comes due to the mechanical vibration transmitted
from the vibratory vehicle (on which the PTU is mounted) or environmental factor
(such as wind). The aim of video stabilization is to filter out the latter motion [224,
225].
The ideal approach for video stabilization is the hardware solution. Use of
mechanical tools to physically avoid camera vibration is one of the hardware
solutions. Another solution may be to exploit optical or electronic devices to influence
how the camera sensor receives the input light, [226, 227]. These are expensive
solutions or need some additional information about camera motion. Therefore, image
processing based video stabilization (also called digital video stabilization) is the
approach of common choice [226].
Optical flow based method is opted by Chand, Lie and Lu [228] for digital
video stabilization. However, it has an inherent aperture problem [229]. Fuzzy logic
modeling is used in [226] for video stabilization. But, it is time consuming to select
the membership functions and tune their parameters to achieve the satisfactory results.
Image based rendering technique is used in [230]. However, it works well only in case
of slow camera-motion. Block matching methods are used for stabilization [231-233],
but these algorithms do not track blocks in consecutive frames, so they may be misled
by large moving objects [226]. In order to handle these problems, a stabilization
72
Stabilized Active Camera Tracking System
algorithm which estimates the vibratory motion between the frames by taking inputs
from the visual tracking module (discussed in Chapter 4) is proposed. The proposed
stabilization method does not add any extra computational overhead to the system for
estimating the instantaneous vibration. The vibratory motion in the video is filtered
using a simple, low-pass filter. Thus, the stabilization algorithm works at the full
frame rate of a standard video (i.e. 25 fps).
5.3 Proposed Pan-tilt Control Algorithm The proposed pan-tilt control algorithm has been derived from the basic car-
following-control (CFC) law [222]. The CFC law can be used only for closed-loop
system in which the current velocity of the pan-tilt unit (PTU) is fed back to the
control algorithm. It is modified the CFC so that it can be used in an open-loop
system. The modified control algorithm is named as Predictive Open-Loop Car-
Following Control (POL-CFC) strategy. The algorithm generates the pan-tilt velocity
commands in accordance with the Kalman predicted [44] target velocity components
in the video frames. The use of predicted velocity is helpful in compensation of the
inertia for the pan-tilt mechanism and hence following the target without any lag or
inaccuracy. The POL-CFC strategy is described mathematically as:
( )( )
* *
* *
[ 1] [ ] [ 1 | ] [ 1 | ]
[ 1] [ ] [ 1 | ] [ 1 | ]
p p x rp
t t rt y
v n v n Ke n n v n n
v n v n v n n Ke n n
η
η
+ = + + − +
+ = + + − + (5.1)
where vp[n] and vt[n] are the current pan and tilt velocities of the PTU (which
were generated by Eq. (5.1) in the previous iteration), η is a small positive constant in
the range (0.0, 1.0] which controls the amount of the velocity added to the previous
velocity, K is the proportional gain parameter (which is the only parameter to be tuned
for every zoom level of the camera), *[ 1 | ]xe n n+ and *
[ 1 | ]ye n n+ are the predicted errors
in both the axes defined as:
* *
* *
[ 1 | ] [ 1 | ]
[ 1 | ] [ 1 | ]
x x
y y
e n n r x n n
e n n r y n n
+ = − +
+ = − + (5.2)
73
Stabilized Active Camera Tracking System
Furthermore, the * [ 1 | ]rpv n n+ and * [ 1 | ]rtv n n+ in Eq. (5.1) are the predicted
relative velocities of the target in terms of pan-tilt degrees per second, defined as:
* **
* **
[ 1 | ] [ | 1][ 1 | ]
[ 1 | ] [ | 1][ 1 | ]
rp dpp
rt dpp
x n n x n nv n n C
T
y n n y n nv n n C
T
+ − −+ =
+ − −+ =
(5.3)
where Cdpp is a conversion ratio in degrees per pixel determined by a simple
camera calibration procedure for all the zoom levels of the camera, T is the sampling
time (which is inverse of the video frame rate), and ( *[ 1 | ]x n n+ , *[ 1 | ]y n n+ ) and
( *[ | 1]x n n − , *[ | 1]y n n − ) are the target coordinates in the video frame predicted by
Kalman filter in the current and the previous iterations, respectively. Through POL-
CFC, we have achieved 0% overshoot, 1.47 second rise time, and maximum steady-
state error as illustrated in Table 5.1.
5.4 Proposed Video Stabilization Algorithm The proposed video stabilization algorithm takes two inputs: the current video frame
and the current image coordinates (x, y) of the target (estimated by the tracker
described in Chapter 4). The algorithm outputs the stabilized video frame, which can
be seen on a monitor, as illustrated in the block diagram shown in Figure 5.1.
Software approach for video stabilization (which is also called digital video
stabilization) requires a foreground object with respect to which the stabilization
process is performed. In the proposed algorithm, the target is taken as a foreground
object. The inter-frame motion is estimated with the help of its image coordinates (x,
Table 5.1 Maximum steady state error of the tracker
Zoom Maximum Steady State Error (pixels)
1x to 6x 0 7x to 15x ±1
16x to 19x ±2 20x to 25x ±3
74
Stabilized Active Camera Tracking System
y). This motion contains low frequency components (i.e., valid motion) as well as
high frequency components (i.e., vibration). In order to filter out the latter, a low pass
filter is used in x and y axes. The proposed two-dimensional filter is given as:
1
1
ˆˆ (1 ) 0 0ˆ 0 0 (1 )
ˆ
n
n n
nn
n
xxxyyy
α αα α
−
−
−=
− (5.4)
where nx and ny are the xy-coordinates of the target in the un-stabilized
current frame estimated by the tracking module, ˆnx and ˆny are the stabilized xy-
coordinates of the target in the current frame, 1ˆnx − and 1ˆny − are the stabilized xy-
coordinates in the previous frame, and α is the coefficient of the filter having value
from the range 0 < α < 1 to meet the filter stability criterion. The lower the value of α,
the lower the cutoff frequency of the low-pass filter (as illustrated in Figure 5.2). The
cut-off frequency of a low-pass filter is defined as the frequency above which the
magnitude of the frequency response of the filter is ideally zero. However, practical
low-pass filter response does not become zero immediately beyond the cutoff
Figure 5.2 Relationship between α and cut-off frequency of the low-pass
filter
75
Stabilized Active Camera Tracking System
frequency, so the cutoff frequency is normally considered as the frequency where the
magnitude of the frequency response of the filter is 1 2 (i.e., 0.707). Thus, the value
of α can be set according to the vibration frequency involved in the application at
hand. For example, the frequency response of the filter, when α is set to 0.11, is
shown in Figure 5.3 This figure also shows that the cutoff frequency of the filter is 0.5
Hz (at the magnitude of 0.707). It may be observed that the frequency response
around the cutoff frequency (i.e., roll-off) is not perfectly steep because the filter is
real (not ideal) and of first order (i.e., single pole). The filter has single parameter α
and it is easy to tune in run-time while observing its effects on the stabilized video.
However, if one is sure about the optimal cutoff frequency for a specific application, a
higher order filter can be designed using MATLAB filter design tool and
implemented to have as small transition region as possible around the cutoff
frequency in its frequency response.
Once the stabilized xy-coordinates of the target in the current frame are
obtained, the vibratory motion estimation vector is calculated as below:
ˆˆ
n
n
n n
n n
x
y
M x xM y y
= − (5.5)
Figure 5.3 Magnitude of frequency response of the low-pass filter at α = 0.11
76
Stabilized Active Camera Tracking System
where nxM and
nyM are the xy-components of the vibratory motion estimation
vector for the current frame. The estimated vibratory motion is then compensated by
translating every frame-pixel at (i, j) in the opposite direction of the vibratory motion,
such that the new coordinates of the pixel become (i’, j’), calculated as:
''
n
n
x
y
i Mij j M
−=
− (5.6)
It may be noted that i and j are the horizontal and vertical coordinates of the
pixel, respectively. If the new position of the pixel is obtained outside the frame
boundaries, the corresponding pixel is discarded. As a result, we get the stabilized
video frame with respect to the target at the cost of some undefined or vacant regions
Frame 34
Frame 42
Figure 5.4 Original (left side) versus stabilized (right side) frames of a video recorded from a vibratory flying helicopter
77
Stabilized Active Camera Tracking System
at the boundaries from where the pixels were translated. There are two widely used
strategies to fill these regions. One approach is to fill the vacant regions with black
pixels, but it creates unpleasant impact on the viewer because their number is
continuously changing according to the varying instantaneous magnitude of
oscillations due to vibration. Another approach is to fill the vacant regions by the
same regions as that of the previous
frame, but it creates unpleasant artifacts at the boundaries of the video frame. Yet,
another approach is to define a border of fixed size greater than the anticipated
maximum vibration amplitude and resize the stabilized image within the border up to
the full frame. Thus, the stabilized video displayed only the slightly zoomed-in
version of the valid scene with no black border. However, this approach slightly
deteriorated the sharpness of the image due to the bilinear interpolation involved
during resizing. Moreover, the interpolation puts an extra computation overhead to the
system. In order to overcome the above mentioned limitations of the three approaches,
black boundary of a fixed width greater than the anticipated maximum vibration
amplitude without resizing the stabilized portion of the image is proposed in this
thesis.
Figure 5.5 Original versus stabilized x-coordinates of the left truck shown
in Figure 5.4
78
Stabilized Active Camera Tracking System
5.5 Results and Discussion This section presents the results of: (a) the video stabilization algorithm on some off-
line challenging videos, (c) the active camera tracking system (ACTS), and finally (d)
the stabilized ACTS.
5.5.1 Performance of Stabilization Algorithm Figure 5.4 shows vibratory aerial video frames (left side) versus stabilized video
frames (right side) using the proposed algorithm. The un-stabilized video was
recorded from a flying helicopter during tracking a truck using the proposed tracking
algorithm and it contains jitters due to the helicopter vibration creating an unpleasant
effect on the eyes of the viewer. The large yellow crosshair in the video frames is
overlaid to easily perceive the vibration in the un-stabilized frames and its effective
attenuation in the stabilized frames. In order to visualize the effects of stabilization on
the whole image sequence, the
record of x-y-coordinates of the truck (on the left side of the road) in the original and
the stabilized video frames is maintained, because the vibratory motion of the truck
Figure 5.6 Original versus stabilized y-coordinates of the left truck shown in Figure 5.4
79
Stabilized Active Camera Tracking System
can be considered as the vibratory motion of the whole scene. The coordinates are
shown in Figure 5.5 and Figure 5.6, where it may be noted that the coordinates of the
truck in the stabilized frames are smoother than those in the original vibratory video
frames. Another example to show the efficacy of the stabilization algorithm is given
Frame 20
Frame 24
Frame 53
Figure 5.7 Original (left side) versus stabilized (right side) frames of a video recorded from a vibratory hovering helicopter
80
Stabilized Active Camera Tracking System
in Figure 5.7, which presents the frames of the video of a building taken from a
hovering helicopter. Vibrations of the helicopter yield vibratory image frames shown
in left column of the Figure 5.7, right column shows the frames after stabilization.
Plots in Figure 5.8 and Figure 5.9 explain the stabilization process for better
understanding .
5.5.2 Performance of Active Camera Tracking System In this section, some experimental results of the active camera tracking system are
presented to show its robust and accurate performance.
Figure 5.10 shows some frames of a tracking session in which a helicopter is
being tracked. The best match area is represented by a white rectangle, and the frame
center (i.e., the optical axis of the camera) is represented by a white dot. The updated
edge-enhanced template is overlaid at the bottom-right of every frame. The overlaid
text at the top of the frames consists of the correlation peak value, the center
coordinates of the tracked target in the frame, zoom level of the camera, and finally
the pan-tilt velocities of the camera mounted on a PTU in degrees/second. The pan
velocity is positive, if the camera is rotating towards left. The tilt velocity is positive,
if the camera is rotating downwards. It can be seen that the helicopter is automatically
Figure 5.8 Original versus stabilized x-coordinates of the building shown in Figure 5.7
81
Stabilized Active Camera Tracking System
centralized very efficiently and smoothly in the video by increasing the pan velocity
within the first 40 frames (i.e. 1.2 seconds), which is less than even the rise time of
the proposed pan-tilt control system as mentioned in Section 5.1. After the initial
automatic target centralization, the helicopter remains at the center of the frames
throughout the tracking session, and it can be verified by the target coordinates shown
at the overlaid text keeping in view that the frame size is 320×240. It may be noted
that the helicopter is being tracked persistently and precisely with the proposed active
camera tracking system even when: (1) the user had initialized the template
inaccurately due to the motion of the helicopter in the video, and (2) the size, the
appearance, and the velocity of the helicopter is continuously varying. The BMR
adjustment algorithm solves the incorrect initialization problem by resizing/relocating
the BMR so that it tightly encloses the target very efficiently within the first 20
frames. Later on, the BMR is further dynamically resizing/relocating itself according
to the current size of the helicopter by both the scale handling method and the BMR
adjustment algorithm.
Figure 5.11 illustrates how efficiently the proposed system tracks the face of a
person, who is walking in a room with all the lights turned off. The only light, that
was available in the room, was coming from the blinds shown in the frames. This
Figure 5.9 Original versus stabilized y-coordinates of the building shown in Figure 5.7
82
Stabilized Active Camera Tracking System
natural light created a severe illumination variation in the video, since the camera was
operating in its auto mode. Specifically, when the camera was looking in the direction
of the bright window, the other things (persons, wall, etc.) became very dark (see
Frames 271 to 512), and when there was no bright window in the video frames, the
Frame 1 Frame 20
Frame 40 Frame 300
Frame 385 Frame 520
Figure 5.10 A helicopter is being tracked persistently and precisely with the proposed tracking system even when the user has initialized the template
inaccurately, and the size, the appearance, and the velocity of the helicopter is continuously varying.
83
Stabilized Active Camera Tracking System
whole scene became a little clearer. It may also be noted, that there is noise and no
detail in the whole video due to low light conditions. The target person and the
occluding person are both walking in the same direction, making the scenario even
more complex. It can be further observed in Frame 495, that the occlusion of the
tracked person by the other person happens partly in the bright region and partly in
the dark region of the video frame. Moreover, the track of the target person after the
occlusion is resumed in the very much dark, as shown in Frame 512. Since the
persons were very near to the camera, even a small movement of the persons was
reflecting a large movement in the video frames. Thus, it was a challenging
Frame 176 Frame 271 Frame 325
Frame 481 Frame 495 Frame 512
Frame 528 Frame 540 Frame 569
Figure 5.11 Tracking the face of a person during severe illumination variation, noise, low detail, and occlusion. All the lights in the room were turned off in this experiment to create a challenging scenario. The dark yellow rectangle in Frame
495 indicates that the tracker is currently working in its occlusion handling mode.
84
Stabilized Active Camera Tracking System
experiment for the pan-tilt control algorithm as well. All the problems (i.e. severe
illumination variation, noise, low detail, full occlusion, and fast motion) are handled
very efficiently and robustly by the proposed active camera tracking system in real-
Frame 420
Frame 422
Frame 425
Frame 427
Frame 430
Figure 5.12 Results of un-stabilized (left column) vs. stabilized active camera tracking (right column) of a
distant airplane
85
Stabilized Active Camera Tracking System
time, and the face of the person of interest is always at (or near) the center of the
video frames.
5.5.3 Performance of Stabilized Active Camera Tracking System
In this section, the results of comple stabilized active-camera tracking system are
demonstrated.
Figure 5.12 shows some of the frames of a long active camera tracking and
stabilization session of a very distant airplane at 25x (highest) zoom level of the
camera. The un-stabilized as well as the stabilized video frames are recorded in real-
time for demonstration purpose. Left column depicts the resulting tracking video
frames without stabilization, while the right column shows the resulting tracking
video frames with stabilization. The periodic vibratory force at the rate of 1 Hz was
used to induce vibration on the PTU and thus on the real-time video. In the un-
stabilized video frames, visual tracking and control algorithms always try to keep the
target at the center of the image plane, but due to vibration the airplane is oscillating
about the frame center. Video frames at the right column shows that the stabilization
module of the system successfully diminishes the vibration in the target being tracked.
In Figure 5.13, a pedestrian is being tracked in a cluttered environment at 11x
zoom level. The vibratory source in this case is a Toyata 2400 cc Hilux engine
working at 500 RPM. The synchronized tracking video frames without and with
stabilization are again shown in the left and the right columns, respectively. Images in
the left side highlight the oscillatory motion of the scene, because the active camera
tracking system is mounted on a vibratory vehicle. Images in the right column show
that the tracked person and the varying background scene are stable and the vibration
is significantly attenuated.
5.6 Chapter Summary A robust stabilized active camera tracking system is proposed, consisting of a visual
tracking module, a pan-tilt control module, and a video stabilization module. The
visual tracking module can handle template-drift, noise, object fading (obscuration),
clutter, intermittent occlusion, varying illumination in the scene, high computational
complexity, and varying shapes, scale, and velocity of the maneuvering target during
86
Stabilized Active Camera Tracking System
its motion. The proposed pan-tilt control module is a predictive open-loop car-
following-control algorithm, which moves the camera efficiently and smoothly so that
the object being tracked is always at the center of the video frame. The control
algorithm offers 0% overshoot, negligible steady-state error, and 1.47 second rise-
time. The video stabilization module handles the annoying vibratory motion in the
Frame 196
Frame 209
Frame 226
Frame 249
Frame 261
Figure 5.13 Results of un-stabilized (left column) vs. stabilized active camera
tracking (right column) of a pedestrian
87
Stabilized Active Camera Tracking System
image frame during tracking while the system is mounted on a vibratory platform
(e.g., vehicle, helicopter, etc.). The complete proposed system has been successfully
used for more than two years in indoor as well as outdoor scenarios, and it works in
real-time at the full frame rate of 25 fps.
88
6 Conclusion and Future Work
Visual tracking is a non-trivial task in an unstructured environment and far class
independent target. This thesis presents a new visual tracking framework which
combines correlation, Kalman filter and mean shift algorithm. The proposed tracking
method successfully tracks the target (the type of target is not known already) in an
unknown environment. This chapter summarizes the thesis and provides future
directions for VOT.
6.1 Summary Chapter 1 provides an introduction to visual object tracking, and shows its usability in
other fields, e.g., human-computer interaction, security and surveillance system,
activity recognition, industrial robotics, etc. Moreover, the chapter explains different
issues such as template drift, changing target appearance, occlusion, clutter, similar
objects, etc., which make VOT a non-trivial task
Chapter 2 furnishes itself by providing various classical and contemporary
approaches for VOT. Thus, the old as well as recent techniques for visual tracking are
discussed. Moreover, the chapter provides a list of different online resources, which
includes data set and code for different tracking algorithms.
Chapter 3 explains the proposed template updating method. The updating
method determines the rate of change in target’s appearance, and sets the update rate
of the template accordingly. This way, the template is efficiently updated for slow as
well as fast moving target without drifting. The proposed updating method
outperforms in comparison (qualitative and quantitative) with three other methods,
i.e., naïve, α, and β updating methods.
Chapter 4 proposes a tracking framework consisting of: (1) correlation tracker,
(2) Kalman filter, and (3) mean shift tracker. These three methods work jointly to
reinforce each other’s strength and to suppress the individual weaknesses. The
correlation tracker normally suffers from the template drift problem, so, the adaptive
template updating method proposed in Chapter 3 is used. In order to handle occlusion,
89
Conclusion and Future Work
KF is combined with the correlation tracker. The threshold for correlation in order to
sense the occurrence of occlusion is set adaptively at each image frame based on its
peak value in the previous frame. Moreover, a search area is defined based KF
predicted position in the next frame to reduce the computation for correlation
matching. The size of the search area is dynamically set according speed and direction
of target’s motion. KF predicted and correlation measured target positions do not
coincide if the correlation-KF method fails to track the target and at the same time
peak correlation value does not drop below the threshold due to the presence of any
similar object in the background. In this case, adaptive fast mean shift algorithm is
proposed, which finds the position of moving region, i.e., candidate target, in the
search window. Its difference is calculated with both KF predicted and correlation
measured coordinates, and whichever is finds less is considered as the target position.
The algorithm is compared with nine other recent methods on eleven different
challenging videos which pose different challenges such as occlusion, clutter, change
of size, fast motion, out of plane rotation, etc. The experimental results, which
includes sampled tracked frames for qualitative results, and center location error and
Pascal score for quantitative results, explains that the proposed method tracks the
target more robustly than the other methods.
Chapter 5 provides the detail of stabilized active camera tracking system.
Active camera tracking system is comprised of a camera mounted on a pan-tilt unit. In
order to smoothly track the target using the proposed tracking method, PTU is moved
using the car-following control algorithm. When the ACTS is mounted on vibratory
platform, the output video contains jitters and vibration. Stabilization methods
normally require a reference object according to which the whole frame is stabilized.
The proposed stabilization method uses the target position, calculated by the tracking
method, and smooth down the vibrations using a single pole low pass filter, without
any significance computation overhead. Experimental results show the efficacy of the
algorithm
6.2 Future Work Although the proposed tracking framework shows robustness against different issues,
including occlusion, clutter, changing target’s appearance, heavy motion, etc., but
90
Conclusion and Future Work
there are still many rooms for future work in the tracking method, described as
follows:
If the target changes, significantly, its speed or direction during occlusion. It is
likely that KF would not be able to predict the target position correctly.
The assumption that temporally updated template should not change its
appearance more than 50 percent when compared with the initially selected template,
might not work in case of target moving continuously away from the camera.
The presence of other moving objects similar to target in search area reduce
the robustness of the algorithm if occlusion is not sensed.
The higher order stabilization filter may remove more oscillations in vibratory
video at the cost of tuning more parameters.
91
References
[1] S. Stalder and H. Grabner. (2009, 07 October). on-line boosting trackers. Available: http://www.vision.ee.ethz.ch/boostingTrackers/onlineBoosting.htm
[2] H. Grabner, M. Grabner, and H. Bischof, "Real-Time Tracking via On-line Boosting," in British Machine Vision Conference (BMVC), 2006, pp. 47-56.
[3] B. Babenko, M. H. Yang, and S. Belongie, "Robust object tracking with online multiple instance learning," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, pp. 1619-1632, 2011.
[4] A. Adam, E. Rivlin, and I. Shimshoni, "Robust fragments-based tracking using the integral Histogram," in IEEE conference on computer vision and pattern recognition (ICPR), 2006, pp. 798-805.
[5] D. P. Chau, F. Bremond, and M. Thonnat, "Object Tracking in Videos: Approaches and Issues," The International Workshop'Rencontres UNS-UD'(RUNSUD), 2013.
[6] R. T. Collins, Y. Liu, and M. Leordeanu, "Online selection of discriminative tracking features," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, pp. 1631-1643 2005.
[7] X. Zhang, W. Hu, S. Maybank, X. Li, and M. Zhu, "Sequential particle swarm optimization for visual tracking," in IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR 2008., 2008, pp. 1-8
[8] M. Yang, Y. Wu, and G. Hua, "Context-aware visual tracking," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, pp. 1195-1209, 2009.
[9] D. Comaniciu and V. Ramesh, "Mean Shift and Optimal Prediction for Efficient Object Tracking," in IEEE International Conference on Image Processing (ICIP), 2000, pp. 70–73.
[10] D. Comaniciu, V. Ramesh, and P. Meer, "Real-time tracking of non-rigid objects using mean shift," in IEEE Conference on Computer Vision and Pattern Recognition . , 2000, pp. 142-149.
[11] S. Hare, A. Saffari, and P. H. S. Torr, "Struck: Structured output tracking with kernels," in IEEE International Conference on Computer Vision (ICCV), 2011, 2011, pp. 263-270.
[12] A. Yilmaz, O. Javed, and M. Shah, "Object tracking: A survey," Acm Computing Surveys (CSUR), vol. 38, 2006.
92
[13] Y. Li and R. Nevatia, "Key Object Driven Multi-category Object Recognition, Localization and Tracking Using Spatio-temporal Context," in Europian Conference on Computer Vision, , 2008, pp. 409-422.
[14] D. A. Ross, J. Lim, R. S. Lin, and M. H. Yang, "Incremental learning for robust visual tracking," International Journal of Computer Vision, vol. 77, pp. 125-141, 2008.
[15] D.-S. Jang and H.-I. Choi, "Active models for tracking moving objects," Pattern Recognition, vol. 33, pp. 1135-1146, 2000.
[16] X. Zhang, W. Hu, W. Qu, and S. Maybank, "Multiple object tracking via species-based particle swarm optimization," IEEE Transactions on Circuits and Systems for Video Technology vol. 20, pp. 1590-1602, 2010.
[17] D. Comaniciu, V. Ramesh, and P. Meer, "Kernel-based object tracking," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, 2003.
[18] S. Avidan, "Support vector tracking," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, pp. 1064-1072, 2004.
[19] X. Mei and H. Ling, "Robust visual tracking using ℓ 1 minimization," in IEEE 12th International Conference on Computer Vision, 2009, pp. 1436-1443.
[20] K. A. Joshi and D. G. Thakore, "A Survey on Moving Object Detection and Tracking in Video Surveillance System," International Journal of Soft Computing and Engineering (IJSCE) ISSN, pp. 2231-2307, 2012.
[21] Z. Li, C. Xu, and Y. Li, "Robust object tracking using mean shift and fast motion estimation," in IEEE International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS) , 2007, pp. 734-737
[22] H. T. Nguyen, Q. Ji, and A. W. M. Smeulders, "Spatio-temporal context for robust multitarget tracking," IEEE Transactions on Pattern Analysis and Machine Intelligence vol. 29, pp. 52-64, 2007.
[23] R. Akbari, M. D. Jazi, and M. Palhang, "A hybrid method for robust multiple objects tracking in cluttered background," in Information and Communication Technologies, 2006. ICTTA'06. 2nd, 2006, pp. 1562-1567.
[24] C. Yang, R. Duraiswami, and L. Davis, "Efficient mean-shift tracking via a new similarity measure," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2005, pp. 176-183.
[25] C. Beleznai, B. Frühstück, and H. Bischop, "Human Tracking by Fast Mean Shift Mode Seeking," Trans. Journal of Multimedia (JMM), vol. 1, pp. 1-8, April 2006.
[26] C. Beleznai, B. Frühstück, and H. Bischop, "Human Tracking by Mode Seeking," in Proc. 4th International Symposium on Image and Signal Processing and Analysis (ISPA), 2005, pp. 1-6.
93
[27] C. Beleznai, B. Frühstück, and H. Bischop, "Tracking Multiple Humans using Fast Mean Shift Mode Seeking," in IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005, pp. 25-32.
[28] C. Beleznai, B. Frühstück, and H. Bischop, "Detecting Humans in Groups using a Fast Mean Shift Procedure," in Proc. 28th Workshop of the Austrian Association for Pattern Recogniton (AAPR), 2004, pp. 71-78.
[29] C. Beleznai, B. Frühstück, and H. Bischop, "Human Detection in Groups using a Fast Mean Shift Procedure," in International Conference on Image Processing (ICIP), 2004, pp. 349-352.
[30] H. Yang, L. Shao, F. Zheng, L. Wang, and Z. Song, "Recent advances and trends in visual tracking: A review," Neurocomputing, vol. 74, pp. 3823-3831, 2011.
[31] B. Kwolek, "Multi-object Tracking Using Particle Swarm Optimization on Target Interactions," in Advances in Heuristic Signal Processing and Applications, ed: Springer, 2013, pp. 63-78
[32] X. Li, T. Zhang, X. Shen, and J. Sun, "Object Tracking using an Adaptive Kalman Filter combined with Mean Shift," Optical Engineering (OE) Letters, vol. 49(2), February 2010.
[33] L. Wen, Z. Cai, Z. Lei, D. Yi, and S. Li, "Robust Online Learned Spatio-Temporal Context Model for Visual Tracking," IEEE Transactions on Image Processing, 2013.
[34] Z. Zivkovic and B. Krose, "An EM-like algorithm for color-histogram-based object tracking," in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2004, pp. 798-803.
[35] K. Cannons, "A review of visual tracking," Dept. Comput. Sci. Eng., York Univ., Toronto, Canada, Tech. Rep. CSE-2008-07, 2008.
[36] A. Ali and S. M. Mirza, "Object tracking using correlation, Kalman filter and fast means shift algorithms," in International Conference on Emerging Technologies, 2006. ICET'06. , Islamabad, 2006, pp. 174-178.
[37] H. Grabner, J. Matas, L. Van Gool, and P. Cattin, "Tracking the invisible: Learning where the object might be," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 1285-1292.
[38] L. Anton-Canalis, M. Hernandez-Tejera, and E. Sanchez-Nielsen, "Particle swarms as video sequence inhabitants for object tracking in computer vision," in Sixth International Conference on Intelligent Systems Design and Applications, 2006. ISDA'06. , 2006, pp. 604-609.
[39] H. Zhou, Y. Yuan, Y. Zhang, and C. Shi, "Non-rigid object tracking in complex scenes," Pattern Recognition Letters, vol. 30, pp. 98-102, 2009.
94
[40] H. Grabner, C. Leistner, and H. Bischof, "Semi-supervised on-line boosting for robust tracking," in Computer Vision–ECCV, ed: Springer, 2008, pp. 234-247.
[41] D. Geronimo, A. M. Lopez, A. D. Sappa, and T. Graf, "Survey of pedestrian detection for advanced driver assistance systems," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, pp. 1239-1258, 2010.
[42] J. Kwon and K. M. Lee, "Visual tracking decomposition," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 1269-1276.
[43] Y. Zheng and Y. Meng, "Adaptive object tracking using particle swarm optimization," in International Symposium on Computational Intelligence in Robotics and Automation, 2007. CIRA 2007., 2007, pp. 43-48.
[44] J. Ahmed, M. N. Jafri, M. Shah, and M. Akbar, "Real-Time Edge-Enhanced Dynamic Correlation and Predictive Open-Loop Car Following Control for Robust Tracking," Machine Vision and Applications Journal, vol. 19, pp. 1-25, January 2008.
[45] K. Zhang and H. Song, "Real-time visual tracking via online weighted multiple instance learning," Pattern Recognition, 2012.
[46] N. Jifeng, L. Zhang, D. Zhang, and C. Wu, "Robust object tracking using joint color-texture histogram," International Journal of Pattern Recognition and Artificial Intelligence, vol. 23, pp. 1245-1263, 2009.
[47] N. A. Ogale, "A survey of techniques for human detection from video," Survey, University of Maryland, 2006.
[48] A. M. Abdel Tawab, M. B. Abdelhalim, and S.-D. Habib, "Efficient multi-feature PSO for fast gray level object-tracking," Applied Soft Computing, 2013.
[49] C. Ridder, O. Munkelt, and H. Kirchner, "Adaptive background estimation and foreground detection using kalman-filtering," in Proceedings of International Conference on recent Advances in Mechatronics, 1995, pp. 193-199.
[50] B. Zeisl, C. Leistner, A. Saffari, and H. Bischof, "On-line semi-supervised multiple-instance boosting," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, 2010.
[51] E. Trucco and K. Plakas, "Video tracking: a concise survey," Oceanic Engineering, IEEE Journal of, vol. 31, pp. 520-529 2006.
[52] C. Shan, T. Tan, and Y. Wei, "Real-time Hand Tracking using a Mean Shift Embedded Particle Filter," Trans. Pattern Recognition, vol. 40, pp. 1958-1970, 2007.
95
[53] J. Santner, C. Leistner, A. Saffari, T. Pock, and H. Bischof, "PROST: Parallel robust online simple tracking," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 723-730.
[54] S. K. Borra and S. K. Chaparala, "Tracking of an Object in Video Stream Using a Hybrid PSO-FCM and Pattern Matching," International Journal of Engineering, vol. 2, 2013.
[55] J. K. Aggarwal and Q. Cai, "Human motion analysis: A review," in Nonrigid and Articulated Motion Workshop, 1997. Proceedings., IEEE, 1997, pp. 90-102
[56] B. D. Lucas and T. Kanade, "An iterative image registration technique with an application to stereo vision," in 7th international joint conference on Artificial intelligence, 1981.
[57] X. Wang, L. Liu, and Z. Tang, "Infrared Human Tracking with Improved Mean Shift Algorithm based on Multi-cue Fusion," Trans. Journal of Applied Otics, vol. 48, pp. 4201-4212, July 2009.
[58] B. Zhan, D. N. Monekosso, P. Remagnino, S. A. Velastin, and L.-Q. Xu, "Crowd analysis: a survey," Machine Vision and Applications, vol. 19, pp. 345-357 2008.
[59] M. Isard and A. Blake, "Condensation—conditional density propagation for visual tracking," International Journal of Computer Vision, vol. 29, pp. 5-28, 1998.
[60] W. Kang and F. Deng, "Research on intelligent visual surveillance for public security," in Computer and Information Science, 2007. ICIS 2007. 6th IEEE/ACIS International Conference on, 2007, pp. 824-829.
[61] J. Jeyakar, R. V. Babu, and K. R. Ramakrishnan, "Robust object tracking with background-weighted local kernels," Computer Vision and Image Understanding, vol. 112, pp. 296-309, 2008.
[62] K. Zhang, L. Zhang, and M.-H. Yang, "Real-time compressive tracking," in Computer Vision–ECCV 2012, ed: Springer, 2012, pp. 864-877.
[63] O. Arikan and L. Ikemoto, Computational Studies of Human Motion: Tracking and Motion Synthesis: Now Publishers Inc, 2006.
[64] X. Jia, H. Lu, and M. H. Yang, "Visual tracking via adaptive structural local sparse appearance model," in IEEE Conference on Computer Vision and Pattern Recognition (2012, pp. 1822-1829.
[65] M. I. Khan, J. Ahmed, A. Ali, and A. Masood, "Robust Edge-Enhanced Fragment Based Normalized Correlation Tracking in Cluttered and Occluded Imagery," Signal Processing, Image Processing and Pattern Recognition, pp. 169-176, 2009.
96
[66] I. S. Kim, H. S. Choi, K. M. Yi, J. Y. Choi, and S. G. Kong, "Intelligent visual surveillance—A survey," International Journal of Control, Automation and Systems, vol. 8, pp. 926-939, 2010.
[67] W. Zhong, H. Lu, and M.-H. Yang, "Robust object tracking via sparsity-based collaborative model," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, 2012, pp. 1838-1845.
[68] T. B. Moeslund, A. Hilton, and V. Krüger, "A survey of advances in vision-based human motion capture and analysis," Computer Vision and Image Understanding, vol. 104, pp. 90-126 2006.
[69] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, "A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking," IEEE Transactions on Signal Processing, vol. 50, pp. 174-188, 2002.
[70] A. S. Jalal and V. Singh, "The State-of-the-Art in Visual Object Tracking," Informatica, vol. 36, pp. 227-248, 2012.
[71] X. Li, W. Hu, C. Shen, Z. Zhang, A. Dick, and A. v. d. Hengel, "A Survey of Appearance Models in Visual Object Tracking," ACM Transactions on Itelligent Systems and Technology, 2013.
[72] D.-N. Ta, W.-C. Chen, N. Gelfand, and K. Pulli, "SURFTrac: Efficient tracking and continuous object recognition using local feature descriptors," in IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2009, pp. 2937-2944
[73] I. Skrypnyk and D. G. Lowe, "Scene modelling, recognition and tracking with invariant image features," in Third IEEE and ACM International Symposium on Mixed and Augmented Reality, 2004. ISMAR 2004, pp. 110-119
[74] T. Ko, "A survey on behavior analysis in video surveillance for homeland security applications," in 37th IEEE Applied Imagery Pattern Recognition Workshop, 2008. AIPR'08. , 2008, pp. 1-8.
[75] A. Ess, K. Schindler, B. Leibe, and L. Van Gool, "Object detection and tracking for autonomous navigation in dynamic environments," The International Journal of Robotics Research, vol. 29, pp. 1707-1725 2010.
[76] P. Mistry and P. Maes, "SixthSense: a wearable gestural interface," in ACM SIGGRAPH ASIA 2009 Sketches, 2009, p. 11.
[77] G. R. Bradski, "Real time face and object tracking as a component of a perceptual user interface," in Fourth IEEE Workshop on Applications of Computer Vision (WACV'98). , 1998, pp. 214-219.
[78] Z. Zhu and Q. Ji, "Eye gaze tracking under natural head movements," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, pp. 918-923
97
[79] Siemens. (21-10-11). Sistore CX EDS. Available: https://www.cee.siemens.com/web/sk/sk/priemysel/technologie-budov/katalogove-listy/Katalogy_poziarnychPriemyselna_televizia/b299.pdf
[80] L. Collins, F. Kanade, T. Duggins, and E. Tolliver, "Hasegawa. A system for video surveillance and monitoring: Vsam final report," ed: Technical Report CMU-RI-TR-00-12, Robotics Institute, Carnegie Mellon University, 2000.
[81] I. Haritaoglu, D. Harwood, and L. S. Davis, "W4: real-time surveillance of people and their activities," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, pp. 809-830, 2000.
[82] V. Kettnaker and R. Zabih, "Bayesian multi-camera surveillance," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1999, pp. 1-18.
[83] W. Hu, T. Tan, L. Wang, and S. Maybank, "A survey on visual surveillance of object motion and behaviors," IEEE Transactions on Systems, Man and Cybernetics, vol. 34, pp. 334-352, August 2004.
[84] R. T. Collins, A. J. Lipton, H. Fujiyoshi, and T. Kanade, "Algorithms for cooperative multisensor surveillance," Proceedings of the IEEE, vol. 89, pp. 1456 - 1477, 2001.
[85] M. Greiffenhagen, D. Comaniciu, H. Niemann, and V. Ramesh, "Design, analysis, and engineering of video monitoring systems: an approach and a case study," Proceedings of the IEEE, vol. 89, pp. 1498 - 1517, 2001.
[86] R. Kumar, H. Sawhney, S. Samarasekera, S. Hsu, H. Tao, Y. Guo, K. Hanna, A. Pope, R. Wildes, D. Hirvonen, M. Hansen, and P. Burt, "Aerial video surveillance and exploitation," Proceedings of the IEEE, vol. 89, pp. 1518 - 1539, 2001.
[87] B. Coifman, D. Beymer, P. McLauchlan, and J. Malik, "A real-time computer vision system for vehicle tracking and traffic surveillance," Transportation Research Part C: Emerging Technologies, vol. 6, pp. 271-288, 1998.
[88] J.-C. Tai, S.-T. Tseng, C.-P. Lin, and K.-T. Song, "Real-time image tracking for automatic traffic monitoring and enforcement applications," Image and Vision Computing, vol. 22, pp. 485-501 2004.
[89] O. Masoud and N. P. Papanikolopoulos, "A novel method for tracking and counting pedestrians in real-time using a single camera," Vehicular Technology, IEEE Transactions on, vol. 50, pp. 1267-1278 2001.
[90] N. P. Papanikolopoulos and P. K. Khosla, "Adaptive robotic visual tracking: Theory and experiments," IEEE Transactions on Automatic Control, vol. 38, pp. 429-445, 1993.
[91] Y. Sakagami, R. Watanabe, C. Aoyama, S. Matsunaga, N. Higaki, and K. Fujimura, "The intelligent ASIMO: System overview and integration," 2002, pp. 2478-2483
98
[92] I. F. Mondragon, P. Campoy, J. F. Correa, and L. Mejias, "Visual model feature tracking for UAV control," in IEEE International Symposium on Intelligent Signal Processing, 2007. WISP 2007., 2007, pp. 1-6
[93] J. Lee, R. Huang, A. Vaughn, X. Xiao, J. K. Hedrick, M. Zennaro, and R. Sengupta, "Strategies of path-planning for a UAV to track a ground vehicle," in AINS Conference, 2003.
[94] U. Handmann, T. Kalinke, C. Tzomakas, M. Werner, and W. von Seelen, "Computer vision for driver assistance systems," in International Society for Optics and Photonics: Aerospace/Defense Sensing and Controls, 1998, pp. 136-147.
[95] J. Ahmed, M. Shah, A. Miller, D. Harper, and M. N. Jafri, "A Vision-based System for a UGV to Handle a Road Intersection," in Proceedings of the National Conference on Artificial Intelligence, 2007.
[96] D. Rand, R. Kizony, and P. T. L. Weiss, "The Sony PlayStation II EyeToy: low-cost virtual reality for use in rehabilitation," Journal of neurologic physical therapy, vol. 32, pp. 155-163, 2008.
[97] S. Wang, X. Xiong, Y. Xu, C. Wang, W. Zhang, X. Dai, and D. Zhang, "Face-tracking as an augmented input in video games: enhancing presence, role-playing and control," in Proceedings of the ACM SIGCHI conference on Human Factors in computing systems, 2006, pp. 1097-1106.
[98] A. Amini, R. Owen, P. Anandan, and J. Duncan, "Non-rigid motion models for tracking the left-ventricular wall," in Information Processing in Medical Imaging, 1991, pp. 343-357.
[99] M. J. M. Vasconcelos, S. M. R. Ventura, D. R. S. Freitas, and J. M. R. S. Tavares, "Using statistical deformable models to reconstruct vocal tract shape from magnetic resonance images," Proceedings of the Institution of Mechanical Engineers, Part H: Journal of Engineering in Medicine, vol. 224, pp. 1153-1163, 2010.
[100] M. J. Vasconcelos, S. M. Rua Ventura, D. R. S. Freitas, and J. M. R. S. Tavares, "Towards the automatic study of the vocal tract from magnetic resonance images," Journal of Voice pp. 732-42, 2010.
[101] C. Stauffer and W. E. L. Grimson, "Learning patterns of activity using real-time tracking," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, pp. 747-757, 2000.
[102] R. Bodor, B. Jackson, and N. Papanikolopoulos, "Vision-based human tracking and activity recognition," in Proc. of the 11th Mediterranean Conf. on Control and Automation, 2003.
[103] J. M. Fitts, "Precision correlation tracking via optimal weighting functions," in 18th IEEE Conference on Decision and Control including the Symposium on Adaptive Processes, 1979, pp. 280-283.
99
[104] K. Fukunaga and L. Hostetler, "The estimation of the gradient of a density function, with applications in pattern recognition," IEEE Transactions on Information Theory, vol. 21, pp. 32-40, 1975.
[105] Y. Cheng, "Mean shift, mode seeking, and clustering," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 17, pp. 790-799, 1995.
[106] D. Comaniciu and P. Meer, "Mean shift: A robust approach toward feature space analysis," IEEE Transactions on Pattern Analysis and Machine Intelligence vol. 24, pp. 603-619, 2002.
[107] D. Comaniciu and P. Meer, "Robust analysis of feature spaces: color image segmentation," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1997, pp. 750-755.
[108] A. O. Hero Iii, B. Ma, O. J. J. Michel, and J. Gorman, "Applications of entropic spanning graphs," IEEE Signal Processing Magazine, vol. 19, pp. 85-95, 2002.
[109] C. Shen, M. Brooks, and A. Van Den Hengel, "Fast global kernel density mode seeking: Applications to localization and tracking," IEEE Transactions on Image Processing, vol. 16, pp. 1457-1469, 2007.
[110] R. E. Kalman and R. S. Bucy, "New results in linear filtering and prediction theory," Journal of Basic Engineering, vol. 83, pp. 95-108, 1961.
[111] E. Brookner and J. Wiley, Tracking and Kalman filtering made easy: Wiley New York, 1998.
[112] G. Welch and G. Bishop, "An introduction of the Kalman filter TR 95-041 Department of Computer Science," University of North Carolina at Chapel Hill, 2005.
[113] M. S. Grewal and A. P. Andrews, Kalman filtering: theory and practice using MATLAB: Wiley. com, 2011.
[114] Y. Boykov and D. P. Huttenlocher, "Adaptive Bayesian recognition in tracking rigid objects," in IEEE Conference on Computer Vision and Pattern Recognition, 2000., pp. 697-704.
[115] D. Beymer, P. McLauchlan, B. Coifman, and J. Malik, "A real-time computer vision system for measuring traffic parameters," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1997, pp. 495-501
[116] T. J. Broida and R. Chellappa, "Estimation of object motion parameters from noisy images," IEEE Transactions on Pattern Analysis and Machine Intelligence pp. 90-99 1986.
[117] D. B. Gennery, "Visual tracking of known three-dimensional objects," International Journal of Computer Vision, vol. 7, pp. 243-270, 1992.
[118] M. Isard and A. Blake, "Active contours," ed: Springer-Verlag, 1998.
100
[119] D. Terzopoulos and R. Szeliski, "Tracking with Kalman snakes," in Active vision, 1993, pp. 3-20.
[120] E. V. Cuevas, D. Zaldivar, and R. Rojas, Kalman filter for vision tracking: Freie Univ., Fachbereich Mathematik und Informatik, 2005.
[121] N. Peterfreund, "Robust tracking of position and velocity with Kalman snakes," IEEE Transactions on Pattern Analysis and Machine Intelligence vol. 21, pp. 564-569 1999.
[122] B. D. O. Anderson and J. B. Moore, Optimal filtering: DoverPublications. com, 2012.
[123] A. Doucet, S. Godsill, and C. Andrieu, "On sequential Monte Carlo sampling methods for Bayesian filtering," Statistics and computing, vol. 10, pp. 197-208, 2000.
[124] G. M. Rao and C. Satyanarayana, "Visual Object Target Tracking Using Particle Filter: A Survey," International Journal of Image, Graphics and Signal Processing, pp. pp. 57-71, 2013.
[125] R. O. Duda and P. E. Hart, Pattern classification and scene analysis vol. 3: Wiley New York, 1973.
[126] R. C. Gonzalez and R. E. Woods, Digital Image Processing, 2nd ed.: Prentice-Hall, Inc., 2002.
[127] C. Kuglin and D. Hines, "The Phase Correlation Image Alignment Method," in International Conference on Cybernetics and Society, 1975, pp. 163-165.
[128] J. P. Lewis, "Fast Normalized Cross-Correlation," in Vision Interface, 1995, pp. 120-123.
[129] S.-I. Chien and S.-H. Sung, "Adaptive window method with sizing vectors for reliable correlation-based target tracking," Pattern Recognition, vol. 33, pp. 237-249 2000.
[130] R. Manduchi and G. A. Mian, "Accuracy analysis for correlation-based image registration algorithms," in IEEE International Symposium on Circuits and Systems (ISCAS'93), 1993, pp. 834-837.
[131] H. S. Stone, B. Tao, and M. McGuire, "Analysis of image registration noise due to rotationally dependent aliasing," Journal of Visual Communication and Image Representation, vol. 14, pp. 114-135, 2003.
[132] H. S. Stone, "Fourier-based image registration techniques," NEC Research, 2002.
[133] H. Foroosh, J. B. Zerubia, and M. Berthod, "Extension of phase correlation to subpixel registration," IEEE Transactions on Image Processing, vol. 11, pp. 188-200, 2002.
101
[134] Y. Keller, A. Averbuch, and O. Miller, "Robust Phase Correlation," in 17th International Conference on Pattern Recognition (ICPR’04), 2004, pp. 740-743.
[135] J. Ahmed and M. N. Jafri, "Improved Phase Correlation Matching," in ICISP-08:International Conference on Image and Signal Processing, France, 2008, pp. 128-135.
[136] S. Blackman and R. Popoli, Design and Analysis of Modern Tracking Systems. Boston: Artech House, 1999.
[137] M. Nixon and A. Aguado, Feature Extraction and Image Processing: Newnes, Oxford, 2002.
[138] M. Asgarizadeh and H. Pourghassem, "A robust object tracking synthetic structure using regional mutual information and edge correlation-based tracking algorithm in aerial surveillance application," Signal, Image and Video Processing, pp. 1-15 2013.
[139] C. R. Wren, A. Azarbayejani, T. Darrell, and A. P. Pentland, "Pfinder: Real-time tracking of the human body," IEEE Transactions on Pattern Analysis and Machine Intelligence vol. 19, pp. 780-785 1997.
[140] W. E. L. Grimson, C. Stauffer, R. Romano, and L. Lee, "Using adaptive tracking to classify and monitor activities in a site," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition,, 1998, pp. 22-29.
[141] C. Stauffer and W. E. L. Grimson, "Adaptive background mixture models for real-time tracking," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, , 1999.
[142] P. KaewTraKulPong and R. Bowden, "An improved adaptive background mixture model for real-time tracking with shadow detection," in Video-Based Surveillance Systems, ed: Springer, 2002, pp. 135-144
[143] T. Horprasert, D. Harwood, and L. S. Davis, "A robust background subtraction and shadow detection," in Asian Conference on Computer Vision, 2000, pp. 983-988.
[144] T. Horprasert, D. Harwood, and L. S. Davis, "A statistical approach for real-time robust background subtraction and shadow detection," in International Conference on Computer Vision, 1999, pp. 1-19.
[145] N. M. Oliver, B. Rosario, and A. P. Pentland, "A Bayesian computer vision system for modeling human interactions," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 22, pp. 831-843, 2000.
[146] A. J. Lipton, H. Fujiyoshi, and R. S. Patil, "Moving target classification and tracking from real-time video," in Fourth IEEE Workshop on Applications of Computer Vision, 1998. WACV'98. , 1998, pp. 8-14
102
[147] D. J. Dailey, F. W. Cathey, and S. Pumrin, "An algorithm to estimate mean traffic speed using uncalibrated cameras," IEEE Transactions on Intelligent Transportation Systems vol. 1, pp. 98-107, 2000.
[148] D. J. Dailey and L. Li, "An algorithm to estimate vehicle speed using uncalibrated cameras," in IEEE/IEEJ/JSAI International Conference on Intelligent Transportation Systems,, 1999, pp. 441-446
[149] B. K. P. Horn and B. G. Schunck, "Determining optical flow," Artificial intelligence, vol. 17, pp. 185-203 1981.
[150] M. J. Black and P. Anandan, "The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields," Computer Vision and Image Understanding, vol. 63, pp. 75-104 1996.
[151] R. Szeliski and J. Coughlan, "Spline-based image registration," International Journal of Computer Vision, vol. 22, pp. 199-218, 1997.
[152] J. Shi and C. Tomasi, "Good features to track," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1994 (CVPR'94). , 1994, pp. 593-600.
[153] K. Rangarajan and M. Shah, "Establishing motion correspondence," CVGIP: image understanding, vol. 54, pp. 56-73, 1991.
[154] C. P. Papageorgiou, M. Oren, and T. Poggio, "A general framework for object detection," in IEEE Sixth International Conference on Computer Vision, 1998, pp. 555-562
[155] D. Cremers and C. Schnörr, "Statistical shape knowledge in variational motion segmentation," Image and Vision Computing, vol. 21, pp. 77-86 2003.
[156] B. Li, R. Chellappa, Q. Zheng, and S. Z. Der, "Model-based temporal object verification using video," IEEE Transactions on Image Processing vol. 10, pp. 897-908 2001.
[157] M. Bertalmío, G. Sapiro, and G. Randall, "Morphing active contours," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, pp. 733-737, 2000.
[158] A. R. Mansouri, "Region tracking via level set PDEs without motion computation," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, pp. 947-961, 2002.
[159] X. Liu and T. Yu, "Gradient feature selection for online boosting," in IEEE 11th International Conference on Computer Vision, ICCV 2007, pp. 1-8
[160] S. Avidan, "Ensemble tracking," IEEE Transactions on Pattern Analysis and Machine Intelligence vol. 29, pp. 261-271 2007.
103
[161] J. Wang, X. Chen, and W. Gao, "Online selecting discriminative tracking features using particle filter," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR, 2005, pp. 1037-1042.
[162] L. I. Kuncheva, "Combining pattern classifiers: Methods and algorithms (kuncheva, li; 2004)[book review]," Neural Networks, IEEE Transactions on, vol. 18, pp. 964-964, 2007.
[163] C. M. Bishop and N. M. Nasrabadi, Pattern recognition and machine learning vol. 1: springer New York, 2006.
[164] A. Saffari, C. Leistner, M. Godec, and H. Bischof, "Robust multi-view boosting with priors," in Computer Vision–ECCV ed: Springer, 2010, pp. 776-789
[165] C. Leistner, A. Saffari, P. M. Roth, and H. Bischof, "On robustness of on-line boosting-a competitive study," in IEEE 12th International Conference on Computer Vision Workshops (ICCV Workshops), 2009 2009, pp. 1362-1369
[166] H. Masnadi-Shirazi, V. Mahadevan, and N. Vasconcelos, "On the design of robust classifiers for computer vision," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010 2010, pp. 779-786.
[167] O. Williams, A. Blake, and R. Cipolla, "A sparse probabilistic learning algorithm for real-time tracking," in Ninth IEEE International Conference on Computer Vision, 2003. , 2003, pp. 353-360.
[168] J. Kennedy and R. Eberhart, "Particle Swarm Optimization," in IEEE International Conference on Neural Networks, Piscataway, 1995, pp. 1942-1948.
[169] R. Eberhart and J. Kennedy, "A new optimizer using particle swarm theory," in Proceedings of the Sixth International Symposium on Micro Machine and Human Science, 1995. MHS'95., , 1995, pp. 39-43.
[170] R. Poli, "Analysis of the publications on the applications of particle swarm optimisation," Journal of Artificial Evolution and Applications, 2008.
[171] M. Clerc and J. Kennedy, "The particle swarm-explosion, stability, and convergence in a multidimensional complex space," Evolutionary Computation, IEEE Transactions on, vol. 6, pp. 58-73, 2002.
[172] M. P. Wachowiak, R. Smolíková, Y. Zheng, J. M. Zurada, and A. S. Elmaghraby, "An approach to multimodal biomedical image registration utilizing particle swarm optimization," Evolutionary Computation, IEEE Transactions on, vol. 8, pp. 289-301 2004.
[173] A. P. Engelbrecht, Computational Intelligence: An Introduction: Wiley, 2007.
[174] D. Sedighizadeh and E. Masehian, "Particle swarm optimization methods, taxonomy and applications," international journal of computer theory and engineering, vol. 1, pp. 486-502, 2009.
104
[175] D. L. Donoho, "Compressed sensing," IEEE Transactions on Information Theory, vol. 52, pp. 1289-1306, 2006.
[176] E. J. Candes, J. K. Romberg, and T. Tao, "Stable signal recovery from incomplete and inaccurate measurements," Communications on pure and applied mathematics, vol. 59, pp. 1207-1223, 2006.
[177] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. S. Huang, and S. Yan, "Sparse representation for computer vision and pattern recognition," Proceedings of the IEEE, vol. 98, pp. 1031-1044, 2010.
[178] G. Sapiro, J. Mairal, J. Wright, Y. Ma, T. Huang, and S. Yan, "Sparse Representation for Computer Vision and Pattern Recognition," ed: Minnesota Univ Minneapolis Inst for Mathematics and Its Applications, 2009.
[179] J. Yang, J. Wright, T. S. Huang, and Y. Ma, "Image super-resolution via sparse representation," IEEE Transactions on Image Processing, vol. 19, pp. 2861-2873 2010.
[180] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, "Robust face recognition via sparse representation," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, pp. 210-227, 2009.
[181] X. Mei, "Visual Tracking and Illumination Recovery Via Sparse Representation," 2009.
[182] X. Mei and H. Ling, "Robust visual tracking and vehicle classification via sparse representation," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, pp. 2259-2272, 2011.
[183] B. Liu, L. Yang, J. Huang, P. Meer, L. Gong, and C. Kulikowski, "Robust and fast collaborative tracking with two stage sparse optimization," in Computer Vision–ECCV 2010, ed: Springer, 2010, pp. 624-637.
[184] B. Liu, J. Huang, L. Yang, and C. Kulikowsk, "Robust tracking using local sparse appearance model and k-selection," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 1313-1320.
[185] S. Zhang, H. Yao, X. Sun, and X. Lu, "Sparse coding based visual tracking: Review and experimental comparison," Pattern Recognition, 2013.
[186] A. Oliva and A. Torralba, "The role of context in object recognition," Trends in cognitive sciences, vol. 11, pp. 520-527, 2007.
[187] S. K. Divvala, D. Hoiem, J. H. Hays, A. A. Efros, and M. Hebert, "An empirical study of context in object detection," in IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 1271-1278.
[188] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, "Exploiting the circulant structure of tracking-by-detection with kernels," in Computer Vision–ECCV 2012, ed: Springer, 2012, pp. 702-715.
105
[189] Y. Wu, J. Lim, and M.-H. Yang, "Online Object Tracking: A Benchmark," presented at the Computer Vison and Pattern Recognition, 2013.
[190] S. Wong, "Advanced Correlation Tracking of Objects in Cluttered Imagery," in Defense and Security:International Society for Optics and Photonics, 2005, pp. 158-169.
[191] J. Ahmed, A. Ali, and A. Khan, "Stabilized active camera tracking system," Journal of Real-Time Image Processing, pp. 1-20, 2012.
[192] W. Wang, T. Adalı, and D. Emge, "A Novel Approach for Target Detection and Classification Using Canonical Correlation Analysis," Journal of Signal Processing Systems, pp. 1-12 2012.
[193] A. Ali, H. Kauser, and M. I. Khan, "Automatic Visual Tracking and Firing System for Anti-Aircraft Machine Gun," in 6th Internation Bhurban Conference of Applied Science and Technology, Islamabad, Pakistan., 2009, pp. 253-257.
[194] F. Bousetouane, L. Dib, and H. Snoussi, "Improved mean shift integrating texture and color features for robust real time object tracking," The Visual Computer, pp. 1-16, 2013.
[195] M. Asgarizadeh, H. Pourghassem, G. Shahgholian, and H. Soleimani, "Robust and real time object tracking using regional mutual information in surveillance and reconnaissance systems," in 7th IEEE Iranian Machine Vision and Image Processing (MVIP) Conference, 2011 pp. 1-5.
[196] R. L. Brunson, D. L. Boesen, G. A. Crockett, and J. F. Riker, "Precision trackpoint control via correlation track referenced to simulated imagery," in International Society for Optics and Photonics: Aerospace Sensing, 1992, pp. 325-336.
[197] Available at: http://vision.ucsd.edu/~bbabenko/project_miltrack.shtml.
[198] Available at: http://www.cs.technion.ac.il/~amita/fragtrack/fragtrack.html.
[199] J. N. Wilson and G. X. Ritter, Handbook of Computer Vision–Algorithms in Image Algebra: CRC Press, 2001.
[200] Q. Chen, M. Defrise, and F. Deconinck, "Symmetric phase-only matched filtering of Fourier-Mellin transforms for image registration and recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16, pp. 1156-1168, 1994.
[201] J. Jingying, H. Xiaodong, X. Kexin, and Y. Qilian, " Phase Correlation-based Matching Method with Sub-pixel Accuracy for Translated and Rotated Images," in IEEE International Conference on Signal Processing (ICSP’02), 2002, pp. 752-755.
[202] J. Ahmed, "Adaptive Edge-Enhanced Correlation Based Robust And Real-Time Visual Tracking Framework And Its Deployment In Machine Vision
106
Systems," Ph.D. Research, Department of Electrical Engineering, National University of Science and Technology (NUST), Rawalpindi, Pakistan, 2008.
[203] R. C. Gonzalez, R. E. Woods, and S. L. Eddins, Digtal Image Processing Using MATLAB: Pearson Education Pte. Ltd., 2004.
[204] S. Sutor, R. Röhr, G. Pujolle, and R. Reda, "Efficient Mean Shift Clustering using Exponential Integral Kernels," Trans. International Journal of Electrical and Computer Engineering, vol. 4, pp. 206-210, 2009.
[205] A. Yilmaz, K. Shafique, N. Lobo, X. Li, T. Olson, and M. Shah, "Target tracking in FLIR imagery using mean shift and global motion compensation," in IEEE Workshop on Computer Vision Beyond Visible Spectrum, Kauai, Hawaii, 2001, pp. 54-58.
[206] D. Comaniciu, V. Ramesh, and P. Meer, "Real-time tracking of non-rigid objects using mean shift," in IEEE Conf. on Computer Vision and Pattern Recognition, 2000, pp. 142-149.
[207] R. T. Collins, "Mean-shift blob tracking through scale space," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003, pp. 234-40.
[208] J. Ahmed and M. N. Jafri, "Best-Match Rectangle Adjustment Algorithm for Persistent and Precise Correlation Tracking," in IEEE International Conference on Machine Vision (ICMV), Islambad, Pakistan, 2007.
[209] S. Oron, A. Bar-Hillel, D. Levi, and S. Avidan, "Locally Orderless Tracking," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 1940-1947.
[210] Available at : http://gpu4vision.icg.tugraz.at/index.php?content=subsites/prost/prost.php.
[211] Available at: http://groups.inf.ed.ac.uk/vision/caviar/caviardata1/.
[212] Available at: http://cv.snu.ac.kr/research/~vtd/.
[213] Available at: http://www.cs.toronto.edu/~dross/ivt/.
[214] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, "The pascal visual object classes (voc) challenge," International Journal of Computer Vision, vol. 88, pp. 303-338, 2010.
[215] Z. Kalal, J. Matas, and K. Mikolajczyk, "Pn learning: Bootstrapping binary classifiers by structural constraints," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 49-56.
[216] Y. Wang and Q. Zhao, "Robust object tracking via online Principal Component–Canonical Correlation Analysis (P3CA)," Signal, Image and Video Processing, pp. 1-16, 2013.
107
[217] B. C. Kuo and M. F. Golnaraghi, Automatic control systems vol. 4: John Wiley & Sons New York, 2003.
[218] E. V. Cuevas, D. Zaldivar, and R. Rojas, Intelligent tracking: Freie Univ., Fachbereich Mathematik und Informatik, 2003.
[219] T. J. Ross, Fuzzy logic with engineering applications: John Wiley & Sons, 2009.
[220] H. T. Nguyen and E. A. Walker, A first course in fuzzy logic: CRC press, 2005.
[221] N. Mir-Nasiri, "Camera-based 3D object tracking and following mobile robot," in IEEE Conference on Robotics, Automation and Mechatronics, 2006, pp. 1-6
[222] "Basic Control Law for PTU to Follow a Moving Target,," ed: Application Note 01, Directed Perception Inc, 1996.
[223] E. Vermeulen, "Real-time video stabilization for moving platforms," in 21st Bristol UAV Systems Conference, 2007, p. 3.
[224] M. Tico and M. Vehvilainen, "Robust method of video stabilization," in EUSIPCO-07: European Signal and Image Processing Conference, 2007.
[225] R. Hu, R. Shi, I. f. Shen, and W. Chen, "Video stabilization using scale-invariant features," in 11th International Conference on Information Visualization, 2007. IV'07. , pp. 871-877
[226] S. Battiato, G. Gallo, G. Puglisi, and S. Scellato, "Fuzzy-based motion estimation for video stabilization using SIFT interest points," in IS&T/SPIE Electronic Imaging, 2009, pp. 72500T-72500T-8.
[227] (15-03-2014). Inc., C., “Canon FAQ: What is vari-angle prism? Available: http://www.canon.com/bctv/faq/vari.html
[228] H.-C. Chang, S.-H. Lai, and K.-R. Lu, "A robust and efficient video stabilization algorithm," in IEEE International Conference on Multimedia and Expo, 2004. ICME'04, pp. 29-32
[229] (15-03-2014). Available: http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/OWENS/LECT12/node4.html.
[230] C. Buehler, M. Bosse, and L. McMillan, "Non-metric image-based rendering for video stabilization," 2001, pp. II-609-II-614
[231] S. Auberger and C. Miro, "Digital video stabilization architecture for low cost devices," in Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis, 2005. ISPA 2005. , 2005, pp. 474-479
108
[232] S.-W. Jang, M. Pomplun, G.-Y. Kim, and H.-I. Choi, "Adaptive robust estimation of affine parameters from block motion vectors," Image and Vision Computing, vol. 23, pp. 1250-1263 2005.
[233] F. Vella, A. Castorina, M. Mancuso, and G. Messina, "Digital image stabilization by adaptive block motion vectors filtering," IEEE Transactions on Consumer Electronics, vol. 48, pp. 796-801, 2002.
109