50 Years of Object Recognition: Directions Forward...50 Years of Object Recognition: Directions Forward Alexander Andreopoulos , John K. Tsotsos IBM Research - Almaden 650 Harry Road

50 Years of Object Recognition: Directions Forward

Alexander Andreopoulos∗, John K. Tsotsos∗∗

∗IBM Research - Almaden650 Harry Road

San Jose, California, 95120-6099Tel.: +1-408-927-2947

∗∗Department of Computer Science and EngineeringCentre for Vision Research

York University, Toronto, ON, M3J 1P3, CanadaTel.: +1 416 736 2100 ext.33257, Fax: +1 416 736 5872.

Abstract

Object recognition systems constitute a deeply entrenched and omnipresent component of modern intelligent systems.Research on object recognition algorithms has led to advances in factory and office automation through the creation ofoptical character recognition systems, assembly-line industrial inspection systems, as well as chip defect identificationsystems. It has also led to significant advances in medical imaging, defence and biometrics. In this paper we discussthe evolution of computer-based object recognition systems over the last fifty years, and overview the successesand failures of proposed solutions to the problem. We survey the breadth of approaches adopted over the years inattempting to solve the problem, and highlight the important role that active and attentive approaches must play in anysolution that bridges the semantic gap in the proposed object representations, while simultaneously leading to efficientlearning and inference algorithms. From the earliest systems which dealt with the character recognition problem, tomodern visually-guided agents that can purposively search entire rooms for objects, we argue that a common threadof all such systems is their fragility and their inability to generalize as well as the human visual system can. At thesame time, however, we demonstrate that the performance of such systems in strictly controlled environments oftenvastly outperforms the capabilities of the human visual system. We conclude our survey by arguing that the next stepin the evolution of object recognition algorithms will require radical and bold steps forward in terms of the objectrepresentations, as well as the learning and inference algorithms used.

Keywords: Active Vision, Object Recognition, Object Representations, Object Learning, Dynamic Vision, CognitiveVision Systems

1. Introduction

Artificial vision systems have fascinated humans since pre-historic times. The earliest mention of an artificialvisually-guided agent appears in classical mythology, where a bronze giant named Talos was created by the ancientgod Hephaestus and was given as a gift to King Minos of the Mediterranean island of Crete [1]. According to legendthe robot served as a defender of the island from invaders by circling the island three times a day, while also makingsure that the laws of the land were upheld by the island’s inhabitants.

The fascination and interest for vision systems continues today unabated, not only due to purely intellectualreasons related to basic research, but also due to the potential of such automated vision systems to drastically increasethe productive capacity of organizations. Typically, the most essential component of a practical visually-guided agentis its object recognition module.

∗Corresponding authorEmail addresses: [email protected] (Alexander Andreopoulos), [email protected] ( John K. Tsotsos∗∗)

Preprint submitted to Computer Vision and Image Understanding November 30, 2013

Table of Contents

1 Introduction 1

2 Classical Approaches 92.1 Recognition Using Volumetric Parts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Automatic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3 Perceptual Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4 Interpretation Tree Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.5 Geometric Invariants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.6 Qualitative 3-D Shape-Based Recognition and Deformable Models . . . . . . . . . . . . . . . . . . . 212.7 Function and Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.8 Appearance Based Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.9 Local Feature-Based Recognition and Constellation Methods . . . . . . . . . . . . . . . . . . . . . . 302.10 Grammars and Related Graph Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.11 Some More Object Localization Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402.12 Content-Based Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3 Active and Dynamic Vision 473.1 Active Object Detection Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.2 Active Object Localization and Recognition Literature Survey . . . . . . . . . . . . . . . . . . . . . . 64

4 Case Studies From Recognition Challenges and The Evolving Landscape 714.1 Datasets and Evaluation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.2 Sampling the Current State-of-the-Art in the Recognition Literature . . . . . . . . . . . . . . . . . . 74

4.2.1 Pascal 2005 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.2.2 Pascal 2006 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 784.2.3 Pascal 2007 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794.2.4 Pascal 2008 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.2.5 Pascal 2009 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.2.6 Pascal 2010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 854.2.7 Pascal 2011 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.3 The Evolving Landscape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5 Conclusion 89

2

Object Databases

In

pu

t I

ma

ge

Extract Features

Group Features

Hypothesize Objects

Verify Objects

Im

age

F

ea

ture

s

In

de

xin

g

Prim

itiv

es

Ca

nd

ida

te

O

bje

cts

Re

co

gn

ize

d

O

bje

cts

Figure 1: The components used in a typical object recognition system [21, 22].

Modern computer vision research has its origins in the early 1960s. The earliest applications were pattern recogni-tion systems for character recognition in office automation related tasks [2, 3]. Early work by Roberts in the 1960s [4]first identified the need to match two-dimensional features extracted from images with the three-dimensional represen-tations of objects. Subsequent research established the practical difficulties in reliably and consistently accomplishingsuch a task, especially as the scene complexity increased, as the illumination variability increased, and as time, cost,and sensor noise constraints became more prevalent.

Early systematic work on vision systems is also traced to the Hitachi labs in Japan where the term machine vi-sion originated, to distinguish its more pragmatic goal of constructing practical applications [5], as compared to themore general term computer vision, popularly used to also include less pragmatic goals. An early research thrust in1964 involved the automation of the wire-bonding process of transistors, with the ultimate goal of replacing humanworkers. Even though the automated system achieved 95% accuracy in lab tests, this was deemed too low to replacehuman workers. By 1973 however, fully automated assembly machines had been constructed [6], resulting in theworld’s first image-based machine for the automatic assembly of semiconductor devices. Arguably, the most suc-cessful application of machine vision technologies is in the assembly and verification processes of the semiconductorindustry, enabling the mass production and inspection of complex semiconductors such as wafers [5]. Due to thesheer complexity of this task, human workers could not have possibly solved such problems reliably and efficiently,thus demonstrating how vision technologies have directly contributed to many countries’ economic development byenabling the semiconductor revolution experienced over the last couple of decades.

Early recognition systems also appeared in biomedical research for the chromosome recognition task [7, 8]. Eventhough this work initially had limited impact, its importance became clearer later. Recognition technologies are alsosuccessfully used in the food industry (e.g., for the automated classification of agricultural products [9]), the electron-ics and machinery industry (for automated assembly and industrial inspection purposes [10]), and the pharmaceuticalindustry (for the classification of tablets and capsules) [5]. Many of the models used for representing objects are alsoeffectively employed by the medical imaging community for the robust segmentation of anatomical structures suchas the brain and the heart ventricles [11, 12]. Handwritten character recognition systems are also employed in mailsorting machines as well as for the digitization and automated indexing of documents [13, 14]. Furthermore, trafficmonitoring and license plate recognition systems are also successfully used [15, 16], as are monetary bill recognitionsystems for use with ATMs [5]. Biometric vision-based systems for fingerprint recognition [17], iris pattern recogni-tion [18], as well as finger-vein and palm-vein patterns [19, 20] have also gained acceptance by the law enforcementcommunity and are widely used.

Despite the evident success of recognition systems that are tailored for specific tasks, robust solutions to themore general problem of recognizing complex object classes that are sensed under poorly controlled environments,remain elusive. Furthermore, it is evident from the relevant literature on object recognition algorithms that there isno universal agreement on the definitions of various vision subtasks. Often encountered terms in the literature suchas detection, localization, recognition, understanding, classification, categorization, verification and identification, areoften ill defined, leading to confusion and ambiguities. Vision is popularly defined as the process of discovering fromimages what is present in the world and where it is [23]. Within the context of this paper, we discern four levels oftasks in the vision problem [24]:

3

• Detection: is a particular item present in the stimulus?

• Localization: detection plus accurate location of item.

• Recognition: localization of all the items present in the stimulus.

• Understanding: recognition plus role of stimulus in the context of the scene.

The localization problem subsumes the detection problem by providing accurate location information of the a prioriknown item that is being queried for in the stimulus. The recognition problem denotes the more general problem ofidentifying all the objects present in the image and providing accurate location information of the respective objects.The understanding problem subsumes the recognition problem by adding the ability to decide the role of the stimuluswithin the context of the observed scene.

There also exist alternative approaches for classifying the various levels of the recognition problem. For example,Perona [25] discerns five levels of tasks of increasing difficulty in the recognition problem:

• Verification: Is a particular item present in an image patch?

• Detection and Localization: Given a complex image, decide if a particular exemplar object is located some-where in this image, and provide accurate location information on this object.

• Classification: Given an image patch, decide which of the multiple possible categories are present in that patch.

• Naming: Given a large complex image (instead of an image patch as in the classification problem) determinethe location and labels of the objects present in that image.

• Description: Given a complex image, name all the objects present in the image, and describe the actions and re-lationships of the various objects within the context of this image. As the author indicates, this is also sometimesreferred to as scene understanding.

Within the context of this paper we will discern the detection, localization, recognition and understanding problems,as previously defined.

For relatively small object database sizes with small inter-object similarity, the problem of exemplar based ob-ject detection in unoccluded scenes, and under controlled illumination and sensing conditions, is considered solvedby the majority of the computer vision community. Great strides have also been made towards solving the localiza-tion problem. Problems such as occlusion and variable lighting conditions still make the detection, localization andrecognition problems a challenge. Tsotsos [21] and Dickinson [22] present the components used in a typical objectrecognition system: that is, feature extraction, followed by feature grouping, followed by object hypothesis genera-tion, followed by an object verification stage (see Fig. 1). The advent popularity of machine learning approaches andbags-of-features types of approaches has blurred somewhat the distinction between the above mentioned components.It is not uncommon today to come across popular recognition approaches which consist of a single feature extractionphase, followed by the application of cascades of one or more powerful classifiers.

The definition of an object is somewhat ambiguous and task dependent, since it can change depending on whetherwe are dealing with the detection, localization, recognition or understanding problem. According to one definition[26], the simpler the problem is (i.e., the further away we are from the image understanding problem as definedabove), the closer the definition of an object is to that of a set of templates defining the features that the objectmust possess under all viewpoints and conditions under which it can be sensed. As we begin dealing with moreabstract problems (such as the object understanding problem) the definition of an object becomes more nebulous anddependent on contextual knowledge, since it depends less on the existence of a finite set of feature templates. Forexample, the object class of toys is significantly abstract and depends on the context. See the work of Edelman [27]for a characterization of what might constitute a proper definition of an object. It is important to emphasize thatthere were multiple starting points that one can identify for early definitions of what constitutes an object, since thisis highly dependent on the recognition system used. As previously discussed, one early starting point was work onthe block-world system which led to definitions and generalizations involving 3D objects. However, there were alsoother significantly different early definitions, which emerged from early applications on character and chromosome

4

Selection of objects, events, tasks

Selection of world model

Head/body movements

Eye movements

Inhibitory beam

Adaptation

Selection of visual field; Induced effects

Selection of visual field for detailed analysis; Induced effects

Selection of spatial and feature dimensions of interest within visual field

Selection of operating parameters for selected units

Figure 2: The spectrum of attentional mechanisms, as proposed by Tsotsos [28]. Notice that within this framework, the active vision problem issubsumed by the the attention problem. An active vision system is typically characterized by purposive eye, head or body movements that result inthe selection of the visual field, while an attention system is characterized by the full set of mechanisms and behaviours identified above.

recognition and the analysis of aerial images. These latter applications led to progress in pattern recognition, featuredetection and segmentation but dealt with objects of a different type. These latter approaches are closely related tomodern 2D appearance based object recognition research. Arguably, one of the ultimate goals of recognition researchis to identify a common inference, learning and representational framework for objects, that is not application domainspecific. In Sec.4.3 we discuss how insights from neuroscience might influence the community in the search for sucha framework.

Early research in computer vision was tightly coupled to the general AI problem, as is evidenced by the largeoverlap in the publication outlets used by the two communities up until the late 1980s. Subsequent approaches tovision research shifted to more mathematically oriented approaches that were significantly different from the classicalAI techniques of the time. This spurred a greater differentiation between the two communities. This differentiationbetween the communities is somewhat unfortunate, especially for dealing with the understanding problem definedabove, due to the evident need for high level reasoning capabilities in order to deal with certain vision probleminstances which are intractable if these problems are attempted to be solved by extracting a classical object represen-tation from a scene. The need for reasoning capabilities in artificial vision systems is further supported by experimentsdemonstrating that the human visual system is highly biased by top-down contextual knowledge (an executive con-troller), which can have a drastic effect on how our visual system perceives the world [29]. More recently a newresearch thrust is evident, in particular on the part of EU, Japanese, and South Korean funding agencies, towardssupporting the creation of more pragmatic object recognition systems that are tightly coupled with cognitive roboticssystems. This is evidenced by the fact that during the last decade, around one billion euros have been invested by EU-related funding agencies alone, towards supporting research in cognitive robotics. This last point is further evidencedby the more recent announcement in the U.S. of the National Robotics Initiative for the development of robots thatwork alongside humans in support of individuals or groups. Assuming that current trends continue, it is fair to predictthat research-wise, the future for vision-based robotics systems looks bright.

The passive approach to vision refers to system architectures which exhibit virtually no control over the dataacquisition process, and thus play a minor role in improving the vision system’s performance. The passive approachhas dominated the computer vision literature, partly due to the influential bottom-up approach to vision advocated byMarr [23], but also partly due to a number of difficulties with implementing non passive approaches to vision, whichare elaborated upon later in this paper. Support for a passive approach to the vision problem is evident even in oneof the earliest known treatises on vision [30], where vision is described as a passive process that is mediated by what

5

is referred to as the “transparent” (διαϕανες), an invisible property of nature that allows the sense organ to take theform of the visible object.

In contrast, approaches which exhibit a non trivial degree of intelligent decision making in the image and dataacquisition process, are referred to as “active” approaches [31, 32]. Active approaches offer a set of different tech-niques for solving similar sets of vision problems. Active approaches are motivated by the fact that the human visualsystem has two main characteristics: the eyes can move and visual sensitivity is highly heterogeneous across visualspace [33]. As we will discuss later in this manuscript, active approaches are most useful in vision applications wherethe issue of mobility and power efficiency becomes a significant factor for determining the viability of the constructedvision system. We can classify these approaches as “limited active” approaches which control a single parameter(such as the focus), and “fully active” approaches which control more than one parameter within its full range ofpossibilities.

From the late 1990s and for the next decade, interest in active vision research by the computer vision communityunderwent somewhat of a hiatus. However, the recent funding surge in cognitive vision systems and vision basedrobotics research has reinvigorated research on active approaches to the recognition problem. Historically, earlyrelated work is traced to Brentano [34], who introduced a theory that became known as act psychology. This representsthe earliest known discussion on the possibility that a subject’s actions might play an important role in perception.Barrow and Popplestone [35] presented what is widely considered the first (albeit limited) discussion on the relevanceof object representations and active perception in computer vision. Garvey [36] also presented an early discussionon the benefits of purposive approaches in vision. Gibson [37] critiques the passive approach to vision, and arguesthat a visual system should also serve as a mediator in order to direct action and determine when to move an eye inone direction instead of another direction. Such early research, followed by a number of influential papers on objectrepresentations [38, 39, 40, 41], sparked the titillating and still relatively unexplored question of how task-directedactions can affect the construction of optimal (in terms of their encoding length) and robust object representations.The concept of active perception was popularized by Bajcsy [42, 31], as “a problem of intelligent control strategiesapplied to the data acquisition process”. The use of the term active vision was also popularized by Aloimonos et al.[32] where it was shown that a number of problems that are ill-posed for a passive observer, are simplified whenaddressed by an active observer. Ballard [43] further popularized the idea that a serial component is necessary in avision system. Tsotsos [28] proposed that the active vision problem is a special case of the attention problem, whichis generally acknowledged to play a fundamental role in the human visual system (see Fig. 2). Tsotsos [29] alsopresented a relevant literature survey on the role of attention in human vision.

More recent efforts at formalizing the problem and motivating the need for active perception, are discussed in [44,45, 26]. In [26] for example, the recognition problem is cast within the framework of Probably-Approximately-Correctlearning (PAC learning [46, 47, 48]). This formalization enables the authors to prove the existence of approximatesolutions to the recognition problem, under various models of uncertainty. In other words, given a “good” objectdetector the authors provide a set of sufficient conditions such that for all 0 < ε, δ < 1

2 , with confidence at least 1 − δwe can efficiently localize the positions of all the objects we are searching for with an error of at most ε (see Fig. 3).Another important problem addressed is that of determining a set of constraints on the learned object’s fidelity, whichguarantee that if we fail to learn a representation for the target object “quickly enough” it was not due to system noise,due to an insufficient number of training examples, or due to the use of an over-expressive or over-constrained set ofobject representationsH during the object learning/training phase.

From a less formal perspective, active control of a vision sensor offers a number of benefits [28, 24]. It enables usto: (i) Bring into the sensor’s field of view regions that are hidden due to occlusion and self-occlusion. (ii) Foveate andcompensate for spatial non-uniformity of the sensor. (iii) Increase spatial resolution through sensor zoom and observermotion that brings the region of interest in the depth of field of the camera. (iv) Disambiguate degenerate views due tofinite camera resolution, lighting changes and induced motion [49]. (v) Compensate for incomplete information andcomplete a task.

An active vision system’s benefits must outweigh the associated execution costs [28, 33, 24]. The associated costsof an active vision system include: (i) Deciding the actions to perform and their execution order. (ii) The time toexecute the commands and bring the actuators to their desired state. (iii) Adapt the system to the new viewpoint, findthe correspondences between the old and new viewpoint and compensate for the inevitable ambiguities due to sensornoise. By modeling the costs in a way that improves the efficiency of a task solution, a significant benefit could beachieved. For example, the cost could include the distance associated with various paths that a sensor could follow in

6

Figure 3: A non-rigorous overview of the assumptions under which the object localization and recognition problems (as formalized in [26]) arewell-behaved and efficiently solvable/learnable problems. Notice that within this framework the recognition problem (bottom right box) subsumesthe localization problem (top right box) in that any “good” solution to the localization problem, could also be used to solve the recognition problem(although not necessarily optimally), and vice-versa.

moving from point A to point B, and the task could involve searching for an object in a certain search region. In sucha case, the cost can help us locate the item of interest quickly, by minimizing the distance covered while searching forthe object.

We should point out that a significant portion of the active vision research has been applied on systems wherethe vision algorithms are not applied concurrently to the execution of the actions. Dickmanns introduced a slightlydifferent paradigm to vision, where machine vision is typically applied on dynamic scenes viewed from a movingplatform, or in other words, where vision algorithms are executed concurrently to the actions performed [50]. Hereferred to this framework as dynamic vision. Even though early work on active vision [32] was based on the argumentthat the observer is in motion, in practice, most active object recognition systems assume that the sensor is stationarywhen the images are acquired. We will discuss this framework in more detail in Sec. 3.

In recent work the role of learning algorithms has become much more important to the object recognition problem.This has resulted in a blurring of the distinction that emerged during the 1980s, between computer vision research andclassical AI. This has also resulted in an emerging debate in the community, as to the intended audience of many com-puter vision journals. Often the main innovation presented in such papers is more closely related to machine learning,while vision is only treated as a small after-effect/application of the presented learning-based algorithm. Anotherafter-effect of this pattern is that recent work has drifted away from Marr’s early paradigm for vision. Nevertheless,and as we will see in Sec.4, the fact remains that some of the most successful recognition algorithms rely currently onadvanced learning techniques, thus significantly differentiating them from early recognition research. In Sec.4.3 wediscuss how certain emerging constraints in computing technology might affect the evolution of learning algorithmsover the next few years.

This introductory discussion has outlined the breadth and scope of the approaches adopted by the vision com-munity over the last 50 years, in attempting to solve the recognition problem. The rest of the paper presents a moredetailed overview of the literature on object detection, localization and recognition, with a lesser focus on the effortsmade to address the significantly more general and challenging problem of object understanding. The relevant lit-erature is broadly categorized into a number of relevant subtopics that will help the reader gain an appreciation ofthe diverse approaches taken by the community in attempting to solve the problem [51, 52]. This survey illustratesthe extent to which previous research has addressed the often overlooked complexity-related challenges which, in our

7

Chart 1: A historical perspective (spanning 1971-2012) on the papers that will be discussed with the most detail duringthe rest of this survey, comparing them along a number of dimensions. The horizontal axis denotes the mean score ofthe respective papers from Tables 1-7. Inference Scalability: The focus of the paper on improving the robustness of thealgorithm as the scene complexity or the object class complexity increases. Search Efficiency: The use of intelligentstrategies to decrease the time spent localizing an object when the corresponding algorithm is used for localization.If it is a detection algorithm, this refers to its localization efficiency within the context of a sliding-window approach(i.e., the degree of the use of intelligent strategies to improve detection efficiency). Training Efficiency: The levelof automation in the training process, and the speed with which the training is done. Encoding Scalability: Theencoding length of the object representations as the number of objects increases or as the object representationalfidelity increases. Diversity of Indexing Primitives: The distinctiveness and number of indexing primitives used. UsesFunction or Context: The degree to which function and context influences the algorithm. Uses 3D: The degree towhich depth/range/3D information is used by the algorithm for inference or model representations. Uses Texture: Thedegree to which texture discriminating features are used by the algorithm.

opinion, have inhibited the creation of robust generic recognition systems. This work also fills a void, by presentinga critical and systematic overview of the literature lying at the intersection of active vision and object recognition.Our work also supports the position that active and attentive approaches [31, 29] to the object recognition problemconstitute the next natural evolutionary step in object recognition research.

In Chart 1 we project the algorithms surveyed in this paper along a number of dimensions, and highlight theevolution of the dimensions’ relative importance over the years. A number of patterns become evident upon inspectingChart 1. For example there is a clear increase in focus over the years with respect to the scalability of inference, searchefficiency, and training efficiency. At the same time, in early work there was a significantly greater focus on the useof 3D in recognition systems. Similarly we see that the search for powerful indexing primitives and compact objectrepresentations was always recognized as an important topic in the literature, while there is less consistency in the useof function, context and texture features. These points are elaborated later in this survey.

The remainder of the paper is organized as follows. In Sec. 2 we survey classical approaches to the object recog-nition and understanding problems, where the data acquisition processes demonstrate limited intelligence. Sec. 3further motivates the active and dynamic approaches to vision. In Sec. 4 we discuss some of the most characteristicapproaches adopted over the years by algorithms that have won various vision challenges. The section ends with abrief discussion as to where the field appears to be headed in the near future. Sec. 5 summarizes the paper.

8

2. Classical Approaches

We present a critical overview of classical approaches to the object recognition problem. Most of the methodsdescribed exhibit limited purposive control over the data acquisition process. Therefore, the word “passive” can alsobe used to differentiate the approaches described from “active” approaches to the recognition problem. In subsequentsections we will build upon work in this section, in order to overview the less developed field of active approachesto the recognition problem. Though the earliest work appeared in the late nineteen-eighties, the field still remainsin its infancy, with a plethora of open research problems that need further investigation. It will become evident,as we review the relevant literature, that a solution to the recognition problem will require answers to a number ofimportant questions that were raised in [26]. That is, questions on the effects that finite computational resources and afinite learning time have in terms of solving the problem. The problem of constructing optimal object representationsin particular emerges as an important topic in the literature on passive and active object recognition. Algorithmswhose construction is driven by solutions that provide satisfactory answers to such questions, must form a necessarycomponent of any reliable passive or active object recognition system. The categorization of the relevant literature onclassical approaches to the recognition problem follows the one proposed by Dickinson1, with modifications in orderto include in the survey some more recent neuromorphic approaches that have gained in popularity. This section’spresentation on passive approaches to the recognition problem is also used to contextualize the discussion in Sec. 3,on active approaches to the problem.

As we survey the literature in the field, we use the standard recognition pipeline described in Fig.1 as a commonframework for contextualizing the discussion. We will support the hypothesis that most research on object recognitionis reducible to the problem of attempting to optimize either one of the modules in the pipeline (feature-extraction→ feature-grouping → object-hypothesis → object-verification) or is reducible to the problem of attempting toimprove the operations applied to the object database in Fig.1, by proposing more efficient querying algorithms, ormore efficient object representations (which in turn support better inference and learning algorithms and reduce therelated storage requirements). Sporadically throughout the text, we will recap our discussion, by comparing some ofthe most influential papers discussed so far. We do this by comparing these papers along various dimensions such astheir complexity, the indexing strength, their scalability, the feasibility of achieving general tasks, their use of functionand context, the level of prior knowledge and the extent to which they make use of 3D information. We providediscussions on the assumptions and applicability of the most interesting papers and discuss the extent to which thesemethods address the aspects of detection, localization and recognition/classification.

2.1. Recognition Using Volumetric Parts

Recognition using volumetric parts such as generalized cylinders, constitutes one of the early attempts at solvingthe recognition problem [21], [53]. The approach was popularized by a number of people in the field such as Nevatiaand Binford [54], [38], Marr [23], [55] and Brooks [56] amongst others. This section briefly overviews some of themost popular related approaches. It is interesting to notice that the earliest attempts at solving the object recognitionproblem used high level 3D parts based objects, such as generalized cylinders and other deformable objects, suchas geons and superquadrics. However, in practice, it was too difficult to extract such parts from images. A numberof important points must be made with regards to parts based recognition. High level primitives such as generalizedcylinders, geons and superquadrics — which we describe in more detail later in this paper — provide high level index-ing primitives. View-based/appearance-based approaches on the other hand provide less complex indexing primitives(edges, lines, corners, etc.) which result in an increase in the number of extracted features. This makes such low levelfeatures less reliable as indexing primitives when using object databases numbering thousands of objects. In otherwords, the search complexity for matching image features to object database features increases as the extracted objectfeature complexity decreases. This explains why most of the work that uses such low level primitives is only appliedto object databases with a small number of objects. The above described problem is often referred to in the literatureas the semantic gap problem. As it is argued in [22], a verification/disambiguation/attention-like mechanism is neededto disambiguate and recognize the objects, because such primitives of low complexity are often more frequent andambiguous. In other words, with simple indexing features, the burden of recognition is no longer determined by the

1personal communication, CSC 2523: Object Modeling and Recognition, University of Toronto. Also see [51].

9

task of deciding which complex high-level features are in the image, a difficult task by itself, but instead, it is shiftedto the verification stage and discrimination of simple indexing primitives. The parts-based vs. view-based approachto recognition has generated some controversy in the field. This controversy is exemplified by a sequence of paperson what is euphemistically called the Tarr-Biederman debate [57, 58, 59, 60]. The topic of hierarchies of parts-basedand view-based representations has asssumed a central role in the literature for efficiently representing objects andbridging the above described semantic gap problems. This is exemplified by a number of papers that we survey.

Chart 2: Summary of the Table 1 scores, consisting of the 1971-1996 papers surveyed that make significant contri-butions to volumetric part modelling, automatic programming, perceptual organization and interpretation tree search.Notice that 3D parts and their use for indexing and encoding objects compactly, forms a significant component of thisset of papers.

In Chart 2 and Table 1 we present a comparison, along certain dimensions, for a number of the papers surveyedin Secs.2.1-2.4, which includes a number of recognition models that use volumetric parts-based representations. Forexample, since much of the early research on volumetric parts used manually trained models, a single star is used inthe corresponding training efficiency columns. It will become evident that most progress in the field lies in the imageclassification problem (as opposed to the more demanding object localization and segmentation problems), which isaligned with the current needs of industry for image and video search solutions.

Binford [54] and Nevatia and Binford [38] introduced and popularized the idea of generalized-cylinder-basedrecognition. A generalized cylinder is a parametric representation of 3D cylinder-like objects, defined by a 3D curvecalled an axis, and planar cross sections normal to the axis. These planar cross sections are defined by a cross sectionfunction which in turn depends on the 3D curves’ parameterization. According to Binford’s definition, the crosssections’ center of gravity must intersect the 3D curve. Nevatia and Binford use a subset of generalized cylinders— namely generalized cones — to recognize the objects present in a particular scene. They use the range data ofa particular scene to segment the scene into its constituent parts — by clustering together regions with non-rapidlychanging depth. Each such segmented cluster is, then, further segmented into parts that can be most easily describedby generalized cones. They accomplish this by extracting the medial axis of each segmented cluster — similar toBlum’s medial axis transformation [70] —, and splitting the cluster into a new subcluster whenever the medial axischanges rapidly. This results in a number of segments with smoothly changing medial axes which can be described bythe 2D projections of 3D generalized cones. The authors do not advocate using any common optimization technique todetermine the rotation and scale of the generalized cone that should be used. Instead, they advocate using a rather bruteforce approach — i.e., project each 3D generalized cone to a 2D image plane, rotate it a number of times and determinewhich rotation results in the best fit. The extracted cylinders are then used to build a graph-like representation of thedetected object. For example, if a detected cylinder’s cone angle exceeds a particular angle threshold, that particularpart is labelled as conical as opposed to cylindrical. Elongated parts, with a length to width ratio greater than 3.0, are

10

Diversity ofPapers (1971-1996) Inference Search Training Encoding Indexing Uses Function Uses Uses

Scalability Efficiency Efficiency Scalability Primitives or Context 3D TextureBinford [54] * * * ** ** * ** *Nevatia and Binford [38] * * * ** ** * ** *Brooks [56] * * * *** *** * ** *Zerroug and Nevatia [61] * * * *** *** * ** *Bolles and Horaud [62] * * ** ** ** * *** *Ikeuchi and Kanade [41] * * ** ** *** * *** *Goad [63] * * ** ** ** * ** *Lowe [64] ** ** * * * * ** *Huttenlocher and Ullman [65] ** ** * * ** * ** *Sarkar and Boyer [66] ** ** * *** *** ** ** *Grimson and Lozano-Perez [67] ** *** * ** ** * *** *Fan et al. [68] ** * ** * *** * *** *Clemens [69] ** *** ** ** ** * *** *

Table 1: Comparing some of the more distinct algorithms of Secs.2.1-2.4 along a number of dimensions. For each paper, and where applicable,1-4 stars (*,**,***,****) are used to indicate the strength/expended effort along the corresponding dimension. These often implicitly denote whya particular paper became well known. Where appropriate, a not-applicable label (N/A) is used. Inference Scalability: The focus of the paper onimproving the robustness of the algorithm as the scene complexity or the object class complexity increases. Search Efficiency: The use of intelligentstrategies to decrease the time spent localizing an object when the corresponding algorithm is used for localization. If it is a detection algorithm,this refers to its localization efficiency within the context of a sliding-window approach (i.e., the degree of the use of intelligent strategies to improvedetection efficiency). Training Efficiency: The level of automation in the training process, and the speed with which the training is done. EncodingScalability: The encoding length of the object representations as the number of objects increases or as the object representational fidelity increases.Diversity of Indexing Primitives: The distinctiveness and number of indexing primitives used. Uses Function or Context: The degree to whichfunction and context influences the algorithm. Uses 3D: The degree to which depth/range/3D information is used by the algorithm for inference ormodel representations. Uses Texture: The degree to which texture discriminating features are used by the algorithm.

similarly labelled as well-defined. These qualitative descriptors are used as object features/indices which are in turnused to recognize the object from a database of object descriptors.

Brooks [56] presents the Acronym object recognition system. The author again uses generalized cones to modelthe objects as shown in Fig. 4, 5. Volumetric models and spatial relations amongst object parts are represented usingan object graph. The author also defines a restriction graph which is used to define a class and subclass hierarchy forthe object we are modelling. In a way, this provides “scale” information which specifies the amount of detail we wishto use when defining and searching for an object (see Fig. 5). For each part’s joint the authors define constraints onthe relations between the various parts. Most often these constraints are used to define the permissible angles betweenthe various joints that would lead to acceptable instances of the object. For example, if we are interested in modellingan articulated object such as a piston, constraints can be defined denoting the allowable articulation of the object’sparts that would lead to an instance of a piston. The author defines a constraint manipulation system and shows howthe geometric reasoning that this model provides can be used to reason about the model and discover geometric andquasi-geometric invariants about a particular object model. These discovered invariants are positioned in a predictiongraph, which is used in conjunction with extracted image features to determine whether the desired object exists in theimage. The distinguishing characteristic of Brooks’ work is that it is one of the first systems having used parts-basedrecognition and generalized cylinders to provide reliable results.

Zerroug and Nevatia [61] study the use of the orthographic projection invariants that are generated from instancesof generalized cylinders. These projective invariants are detected from intensity images. A verification phase is usedto verify the goodness-of-match of 3D shapes based on the extracted image features, thus providing an alternativeapproach for recovering 3D volumetric primitives from images.

In practice, the main limitation of generalized cylinders lies in the need to adapt the input scene to the model basedon volumetric parts. The inability to come up with a good optimization scheme for extracting parts from images is asignificant drawback of such algorithms. The models we reviewed here are mainly manually trained, and as we willsee when we review deformable models, it is not clear how to automatically extract such 3D parts-based componentsfrom images. Within the context of Fig.1, the power of generalized cylinders and cones lies in their potential futureevolution as a representationally compact and powerful indexing primitive, which was what motivated early research

11

Figure 4: Some of the volumetric parts used by the Acronym system of Brooks [56].

Figure 5: Examples of the objects generated by Brook’s system [56]. (first-row): Three specializations of the class of electric motors. (second-row):The modelling of articulated objects, such as these subcomponents of a piston (adapted from [56]).

on the topic. While it is not clear how to achieve repeatability when extracting these primitives from 2D images,when reliable 3D data is available the mapping from images to primitives becomes much more predictable. Whileultimately there may not exist a one-to-one mapping between an image and a set of generalized cylinders, there mayvery well exist one-to-many and many-to-one mappings between images and generalized cylinders which can lead toa powerful mechanism for forming object hypotheses.

2.2. Automatic Programming

A critical issue in object recognition is the problem of extracting and organizing the relevant knowledge about anobject and turning this knowledge into a vision program. This is referred to as automatic programming. This subsec-tion briefly overviews some of the early work on the automatic generation of recognition programs [62], [41], [63].This entails learning the important features of an object automatically, learning the most important/discriminatingaspects/views of the object and coming up with search strategies for identifying the object from a jumble of objects.In Sec.2.7 we will discuss the role of affordances in improving recognition performance, since it is believed thatduring early childhood development the association between an object’s visual appearance and its usage is primed.This in turn will highlight the existence of a relatively unexplored link between active approaches to visual inspection(see Sec.3) and automatic programming algorithms, which could in principle improve the full standard recognitionpipeline (see Fig.1).

Before the publication of the early work on automatic programming, many components of successful recognitionprograms were handwritten. For example, in the Acronym system presented in the previous subsection, the user has tomanually define an object graph and a restriction graph for each one of the objects he wishes the system to be capableof recognizing. When we are constructing systems that need to recognize thousands of objects, this is obviously aslow, expensive and suboptimal process.

12

Start: S1, S2, S3, S4, S5

N1: S1 N2: S2, S3, S4 N3: S5

Moment

Topology

N21: S2 N22: S3, S4

Local

N221: S3 N222: S4

EGI EGI EGI EGI EGI

Moment Moment Moment MomentRegion

Shape Edge Edge Edge Edge

Aspect Classification

Viewer Direction

Viewer Rotation

S1 S2 S3 S4 S5

Figure 6: An object model compiled into an interpretation tree. Adapted from Ikeuchi and Kanade [41]. This interpretation tree consists of 5aspects S1,...,S5. At each node of the tree a feature (such as face moment, topology, etc.) is used to discriminate each node’s aspect group andultimately classify each aspect. Then, more features are used to determine the viewpoint direction and rotation of the aspects. Given a region in aninput image, the corresponding features are estimated from that region, and if the estimated features lead to a path in this interpretation tree fromthe root to a leaf, then the object and its attitude/viewpoint have been determined successfully.

Goad [63] published one of the first methods for the automatic generation of object recognition programs. Hecompiles the visible edges of an object in the current field of view into an interpretation tree and uses this tree tointerpret the image. However, this work relies on a single view/aspect of the object. Similarly, Bolles and Horaud [62]use three-dimensional models of various objects to find them in range data. This is a system for the recognition ofobjects in a jumble and under partial occlusions. Given candidate recognized objects, a verification stage follows, andthen the algorithm determines some essential object configurational information, such as which objects are on top ofeach other. A disadvantage of such early work is that it relies heavily on edge/line based models, which are not alwayssuitable for certain objects that are differentiated by more subtle features such as color and texture. Within the contextof the standard recognition pipeline of Fig.1, this work represents an early effort to the generation and verification ofobject hypotheses. While the problem of localizing objects of interest from a jumble remains relevant and constitutesone of the earliest problems that vision systems were tasked with solving, the general version of the problem is stillopen.

Ikeuchi and Kanade [41] modified slightly the Koenderink and van Doorn [71] definition of an object’s aspectto create a multiview recognition system based on the aspect graphs of simple 3D polyhedral objects. Accordingto Ikeuchi and Kanade [41], an aspect consists of the set of contiguous viewer directions from which the same ob-ject surfaces are visible. The authors use a tessellated sphere to sample the object from various viewer directions.Subsequently, they classify these samples into equivalence classes of aspects. Various features of the object are thenextracted to achieve recognition. Features used include face area, face inertia, number of surrounding faces for eachface, distances between the surrounding faces and the face, and the extended gaussian image (EGI) vectors. The poly-hedral object extraction as well as many of these features depend on accurate object depth/structure extraction. This isachieved using a light stripe range finder. These features are used in the interpretation tree for the aspect classification

13

Figure 7: The Gestalt laws.

stage (see Fig. 6). The interpretation tree provides a methodology for determining the aspect currently viewed, theviewer direction and rotation with respect to the aspect (this is achieved again by a decision tree type classificationon the features) and finally, once all this information is extracted, it is possible to make a hypothesis as to the objectcurrently viewed. This is a view based recognition system and shares the main disadvantage of view based approachessince a 3D polyhedron with n faces has O(n3) aspects making such a system very expensive computationally. Thisalgorithm uses a light stripe range finder and therefore it belongs to the group of algorithms that relies on the existenceof 3D information. Within the context of the standard recognition pipeline (Fig.1), this work represents an exampleof feature grouping for the discovery of viewpoint invariants. Similar ideas re-emerge in modern recognition work,often under the disguise of dimensionality reduction techniques, affine invariant interest-point detectors and features,as well as hierarchical object representations, where high-level shared features are used to compactly represent anobject and recognize novel object instances from multiple views. In Sec.3 we will discuss some extensions of aspectgraph matching to the problem of next-view planning and active multiview recognition systems.

As we will see in the next subsections, in more recent work, and as we begin using more low-level features (edges,lines, SIFT-like features [72], etc.) which are currently popular, the need for manual interaction decreases and sta-tistical learning based algorithms accomplish the learning with much less user interaction. However, very little workexists in the active vision literature for automatically extracting optimal object representations in terms of the mini-mum encoding length and robustness. While there exist hierarchical approaches, which are meant to provide compactrepresentations, there do not exist guarantees that these are the minimal or the most robust representations. As it isargued in [26], such optimal representations constitute an important component of any real-time vision system. As itis argued in [73], the problem of creating object representations that are independent of sensor specific biases has notreceived attention commensurate with its importance in the vision problem. The advantages in using object represen-tations with a minimal representation length are well known from the machine learning literature (e.g., Occam’s razorand smaller storage requirements [74]). These advantages are especially important in hierarchical object representa-tions, since the goal of such representations is to minimize the encoding length through a parts based representation ofobjects. However, from at least an information theoretic perspective, there are also advantages in not using a minimumencoding length as this can add a level of redundancy, and redundancy makes recognition systems less fragile. For

14

example, the paper by Nister and Stewenius [75], which we discuss in more detail in a subsequent section, has aninherent redundancy in its decision system, which might partially explain its good performance. It is not clear whatis the best representation for maximizing robustness while also maximizing generalization capability, and minimizingrepresentation length.

2.3. Perceptual Organization

Perceptual organization techniques typically attempt to model the human visual system’s canny ability to detectnon-accidental properties of low-level features, and group these features, in order to build more compact object repre-sentations. Therefore, perceptual organization techniques represent an attempt at improving the feature-grouping andindexing-primitive generation of the standard recognition pipeline (Fig.1). When we extract low level features suchas edges and lines, we are usually interested in finding some sort of correspondence/alignment between those featuresand mapping these groupings to a model of higher complexity. In a typical image, the number of features n can bein the hundreds if not thousands, implying that a brute force approach is impractical for matching 3D object modelsto 2D image features. The Gestaltists’ view is that humans perceive the simplest possible interpretation of the visibledata. The greatest success in the study of perceptual organization has been achieved by assuming that the aim of per-ceptual organization is to detect stable image groupings that reflect non-accidental image properties [64]. A numberof common factors which predispose the element grouping were identified by the Gestaltists [76, 77, 78, 79, 80] (alsosee Fig. 7):

• Similarity: similar features are grouped together.

• Proximity: nearby features are usually grouped together.

• Continuity: Features that lead to continuous or “almost” continuous curves are grouped together.

• Closure: Curves/features that create closed curves are grouped together.

• Common fate: Features with coherent motion are grouped together.

• Parallelism: Parallel curves/features are grouped together.

• Familiar configuration: Features whose grouping leads to familiar objects are usually grouped together.

• Symmetry: Curves that create symmetric groups are grouped together.

• Common region: Features that lie inside the same closed region are grouped together.

In the case of familiar configuration in Fig. 7, the features corresponding to the Dalmatian dog pop-out easilyif this is a familiar image. Otherwise, this can be a challenging image to understand. There is evidence that thebrain uses various measures — such as the total closure, by measuring the total gap of perceived contours [81] — asintermediate steps in shape formation and representation. Berengolts and Lindenbaum [82, 83] demonstrate that thedistribution of saliency — defined as increasing as a point gets near an edge-point — is probabilistically modelledfairly accurately along a curve using the first 3 moments of the distribution and Edgeworth series. Tests are performedfor the distribution of the saliency for points near a curve’s end and far away from the curve’s end. The predictedsaliency distribution matches closely the distribution in real images. Such probabilistic methods are useful for makinginferences regarding the organization of edges/lines in images.

Lowe [64, 84] formalizes some of these heuristics in a probabilistic framework. He uses these heuristics to “join”lines and edges that likely belong together and thus, decreases the overall complexity of the model fitting process.In particular, he searches for lines which satisfy parallelism and collinearity, and searches for line endpoints whichsatisfy certain proximity constraints. For example, given prior knowledge of the average number d of line segmentsper unit area, the expected number N of segments within a radius r of a given line’s endpoint is N = dπr2. If this valueis very low for a particular region but a second line endpoint within this radius r has been detected, this is a strongindication that the two lines are not together accidentally, and the two lines are joined. Similar heuristics are definedfor creating other Gestalt-like perceptual groups based on parallelism and collinearity. He uses these perceptual

15

grouping heuristics in conjunction with an iterative optimization process to fit 3D object models onto images, andrecognize the object(s) in the image.

Huttenlocher and Ullman [65] show that under orthographic projection with a scale factor, three correspondingpoints between image features and object model features are sufficient to align a rigid solid object with an image, upto a reflexive ambiguity. By taking the Canny edges of an image [85], and limiting their feature search to edge cornersand inflection points, they derive an alignment algorithm that aligns those features with a 3D model’s correspondingfeatures. Those features are chosen because they are relatively invariant to rotations and scale changes. An alignmentruns in O(m2n2) time, where m is the number of model interest points and n is the number of image interest points.Once a potential alignment is found, a verification stage takes place where the image model is projected on the imageand all its interest points are compared with the image’s interest point for proximity.

Sarkar and Boyer [66] present one of the earliest attempts at integrating top-down knowledge in the perceptualorganization of images. The authors use the formalism of Bayesian-networks, to construct a system that is capable ofbetter organizing the external stimuli based on certain Gestalt principles. The system executes repeatedly two phases.A bottom-up/pre-attentive phase uses the extracted features from the image to construct a hierarchy of a progressivelymore complex organization of the stimuli. Graph algorithms are then used to mine the image data and find continuoussegments, closed figures, strands, junctions, parallels and intersections. Perceptual inference using Bayesian networksis used to integrate information about various spatial features and form composite hypotheses, based on the evidencegathered so far. The goal is to repeat this process so that ultimately a more high-level organization and reasoning ofthe image features is possible.

Yu and Shi [86, 87] define the concept of “repulsion” and “attraction” for the perceptual organization of imagesand figure-ground segmentation and show how to use normalized-cuts to segment the images into perceptually similarregions. They argue that such forces might contribute to phenomena such as pop-out and texture segmentation andthey discuss their importance to the problem of visual search.

While investigating the role of perceptual organization in vision is a vibrant topic of research, most commer-cially successful recognition systems currently rely on far simpler ‘flat’ architectures, typically consisting of a simplefeature extraction layer followed by a powerful classifier (see Sec.4 on the PASCAL challenges). The need for ob-ject representations with a minimal encoding length was briefly discussed in [26]. For example, Verghese and Pelli[88] provide some evidence in support of the view that the human visual system is a limited capacity informationprocessing device, by experimentally demonstrating that visual attention in humans processes about 30-60 bits ofinformation. More complex feature groupings and indexing primitives inspired by the modelling of non-accidentalimage properties, could offer another approach for improving the standard recognition pipeline of Fig.1.

2.4. Interpretation Tree SearchA number of authors have worked on interpretation tree search based algorithms [67], [68], [69]. Grimson and

Lozano-Perez [67] discuss how local measurements of three-dimensional positions and surface normals, that arerecorded by a set of tactile sensors or by three dimensional range sensors, are used to identify and localize objects.This work represents an example of how interpretation trees can moderate the explosive growth in the size of thehypothesis space as the number of sensed points and the number of surfaces associated with the object model isincreased. The sensor is assumed to provide 3D position and local orientation information of a small number ofpoints on the object, and as such it serves as another example of a system that makes use of the range informationprovided by an active range sensor. The authors model the objects as 3D polyhedra with up to six degrees of freedomrelative to the sensors (3 translational and 3 rotational degrees of freedom), and use local constraints on the distancesbetween faces, angles between face normals, and angles of vectors between sensed points. Given s sensed pointsand n surfaces in each of the known objects, the total number of possible interpretations is ns. An interpretation isdeemed legal if it is possible to determine a rotation and translation that would align the two sets of points. Since it isinfeasible computationally to carry ns tests on all possible combinations, an interpretation tree approach — combinedwith tree pruning — in conjunction with a generate-and-test approach, is used to determine the proper alignment. Theconstraints used for tree pruning (see Fig.8), include (i) the distance constraint, where the distances between pairsof points must correspond to the distances on the model, (ii) the angle constraint where the range of possible anglesbetween normals must contain the angle of known object model normals, and (iii) the direction constraint, wherefor every triple i, j, k of model surfaces, the cones of the directions between the points on the pairs i, j and j, k areextracted and are used to determine whether three sensed 3D points might also lie on surfaces i, j, k.

16

A

B C

NA

NC

NB

Figure 8: The constraints of Grimson and Lozano-Perez [67]. The figure shows three points A, B,C on the surface of a cube, and the threenormals NA, NB, NC of the corresponding surface planes. For each pair of points A,B, B,C, A,C the distance between the two points refersto the corresponding distance constraint. For each pair of normals NA,NB, NB,NC, NA,NC the angle between the normals refers to an angleconstraint. For each pair of surfaces, the cone spanned by the directions between all pairs of points on the two surfaces defines another directionconstraint — given 3 sensed points, the corresponding cones can be extracted and if they form a subset of the corresponding model cones, a matchhas occurred. These cones can be used to prune the interpretation tree.

Each node of the interpretation tree represents one of these model constraints, and at each level of the tree thecorresponding model constraint is compared with one of the possible range-data-derived constraints of the scene. Ifone of the three constraints described above does not hold, the entire interpretation tree branch is pruned. As theauthors show experimentally and by a probabilistic analysis, the computational benefits are significant, since the useof such constraints lead to the efficient pruning of hypotheses which in turn speeds up inference. For example, in oneof the experiments that the authors perform, they demonstrate that this pruning leads to a reduction in the numberof candidate hypotheses: from the 312,500,000 initial possible hypotheses for the object, only 20 were left. Withinthe context of Fig.1, the work by Grimson and Lozano-Perez [67] represents a successful attempt at speeding up thehypothesis generation module, by introducing a number of constraints for solving an initially intractable problem.This constitutes an exemplar-based recognition system, and as such it is a good tool for machine vision tasks wherewe are dealing with the problem of localizing well-defined geometric object (e.g., assembly line inspection).

Fan et al. [68] use dense range data and a graph-based representation of objects — where the graph capturesinformation about the various surface patches and their relation to each other — to recognize objects. These rela-tions might indicate connectivity or occlusion. A given scene’s graph is decomposed into subgraphs (e.g., featuregrouping) and each subgraph ideally represents a detected object’s graph representation. The matching is performedusing three modules: A screener module, which determines the most likely candidate views for each object, a graphmatcher module, which compares candidate matching graphs, and an analyzer, which makes proposals on how tosplit/merge object graphs. Features used during the matching between an object model and a scene extracted graphinclude the visible area of each patch, the orientation of each patch, its principal curvatures, the estimated occlusionratio, etc. Each patch is encoded by a node in the graph, and adjacent patches are encoded with an edge between thenodes. Heuristic procedures are defined on how to merge/split such graphs into subgraphs based on the edge strength.Heuristic procedures for matching graphs are also provided. The authors attempt to address many issues simultane-ously, such as object occlusion and segmenting out background nuisances. However, it is not clear how easy it is toreliably extract low-level features such as patch orientations, and principal curvatures from images. Furthermore, thecomplexity requirements for reliably learning such object representations might be quite high. Within the context ofthe general recognition framework in Fig.1, the system makes a proposal for improving all of the pipeline’s compo-nents, from the extraction and grouping of strong indexing primitives, all the way to the hypothesis generation andobject verification stage. It is not clear, however, how efficient this object representation is in terms of its encodinglength, and it is, thus, not clear how well it compares to other similar approaches. Hierarchical representations ofobjects will be discussed in more detail later in this survey, and are meant to provide reusable and compact objectparts. They constitute a popular and closely related extension of graph based representations of objects.

Many interpretation-tree approaches do not automatically adjust the expressive power of their representation dur-

17

Figure 9: The geometric hashing approach by Lamdan et al. [89]

ing training and online object recognition. As it is discussed in [26] this could have serious implications in the trainingprocess and the reliability of online recognition. A fundamental problem of interpretation tree search is coming upwith a good tradeoff between tree complexity and generalization ability and making the system capable of controllingthe representational complexity [26]. However, this problem of a constant expressive power is shared by most recog-nition systems described in this document, so this is not an issue exclusively related to the interpretation-tree approachto recognition.

2.5. Geometric Invariants

Geometric invariants are often used to provide an efficient indexing mechanism for object recognition systems(see Fig.1). The indexing of these invariants in hash-tables (geometric hashing [90], [91], [89], [92], [93], [94],[95]) is a popular technique for achieving this efficiency. A desirable property of such geometric invariants is thatthey are invariant under certain group actions, thus providing an equivalence class of object deformations modulocertain transformations. Typical deformations discussed in the literature include 2D translations, rotations and scalings(similarity transformations), as well as 2D affine and projective transformations. Many of these hashing techniqueshave also been extended to the 3D case and are particularly useful when there exists reliable range information. Suchrapid indexing mechanisms have also been quite successful in medical imaging and bioinformatics, and particularly inmatching template molecules to target 3D proteins [96]. Thus, under certain restrictions, geometric hashing techniquescan reliably address a number of problems [95]:

1. Obtaining an object indexing mechanism that is invariant under certain often-encountered deformations (e.g.,affine and projective deformations).

2. Obtaining an object indexing mechanism that is fairly robust under partial occlusions.3. Recognizing objects in a scene with a sub-linear complexity in terms of the number of objects in the object

database. The inherent parallelism of geometric hashing approaches is another one of their advantages.

18

Chart 3: Summary of the scores from the Table 2 papers published between 1973-1999 that make significant contri-butions to geometric invariants, 3-D shapes/deformable models, function, context, and appearance based recognition.This set of papers emphasizes the use of powerful indexing and object encoding primitives. We notice that apart fromthe 3-D shape/deformable model papers, the other papers do not make much use of 3-D information for inference andobject modelling/representation, and very few of the other papers make use of function or context. In other wordsthere was little crosstalk between the paradigms during 1973-1999.

A problem with such group invariants is that perspective projections do not form a group. Furthermore, early workon the application of group invariants was complicated by the fact that other common object deformations also do notform groups [97] and cannot be described easily by closed form expressions.

Notice that in the recent literature, geometric invariants tend to emerge within the context of local feature-basedand parts-based methods (which are discussed in more detail later in Sec. 2). For example, interest-point detectorsthat are invariant with respect to various geometric transformations (translation or affine invariance for example) areoften used to determine regions of interest, regardless of image or sensor specific transformations/deformations. Thisprovides a measure of robustness for determining regions around which features or parts can be extracted reliablyand with a degree of invariance (Agarwal and Roth [98], Weber et al. [99], Fergus et al. [100], Lazebnik et al. [101],Mikolajczyk and Schmid [102]).

The origins of the idea of geometric hashing for shapes, are traced to Schwartz and Sharir [90],[91],[95]. Often,subsets of feature points are used to obtain a coordinate frame of the image’s object, and all other model/image featuresuse this coordinate frame to get expressed in affine invariant or projective coordinates. Other popular invariantsinclude the differential invariants (under Euclidean actions) of Gaussian curvature and torsion, as well as a number ofinvariants related to the plane conics which can be applied to lines, arcs, and corners [92]. We will discuss some ofthese invariants later in this section. These invariants are used to rapidly access a hash table. Typically, this procedureis repeated a number of times for each object, votes are accumulated for each such subset of coordinates, and theobject identity hypothesis with the most votes is the choice of the recognized object. In Charts 3, 4 and Table 2 wepresent a comparison, along certain dimensions, for a number of the papers surveyed in this section and Secs.2.6-2.8.

Lamdan et al. [89] use regular 2D images. They extract interest points at locations of deep concavities and sharpconvexities. Assume e00, e10, e01 are an affine basis triplet in the plane. The affine coordinates (α, β) of a point v inthe plane are given by

v = α(e10 − e00) + β(e01 − e00) + e00. (1)

Any affine transformation T would result in the same affine coordinates since

Tv = α(Te10 − Te00) + β(Te01 − Te00) + Te00. (2)

Given an image model of an object with m interest points, for each triple of points, the affine coordinates of the otherm−3 points are extracted. Each such (α, β) coordinate is used as an index to insert into a hash table the affine coordinatebasis and an object ID. This makes it possible to encode each interest point using all possible affine basis coordinates.For each triplet of interest points in the image, their corresponding affine coordinate basis is used to calculate the

19

Chart 4: Summary of the scores from the Table 2 papers published between 2000-2011 that make significant contri-butions to geometric invariants, 3-D shapes/deformable models, function, context, and appearance based recognition.This set of papers emphasizes the use of powerful indexing and object encoding primitives. Compared to the chart inChart 3 we notice an ever smaller role of 3D in recognition, and a greater emphasis on function, context and efficientinference algorithms.

coordinates (α, β) of all the other interest points. These coordinates are hashed in the hash table. The object entry inthe hash table with sufficient votes is chosen as the recognized object. If a verification stage also succeeds — wherethe object edges are compared with those of the scene — the algorithm has succeeded in recognizing the object.Otherwise, a new affine basis coordinate is chosen and the process is repeated (see Fig. 9).

Flynn and Jain [93] describe an approach for 3D to 3D object matching using invariant features indexing. Solidmodels of objects — composed of cylinders, spheres, planes — are used to determine corresponding triples (M1, S 1),(M2, S 2), (M3, S 3) where Mi denotes a model surface and S i denotes a corresponding scene surface. For each pair ofextracted scene cylinders, spheres and planes, an invariant feature is defined and extracted. For example, for each pairof scene cylinders and planes the angle between the plane’s normal and the cylinder’s axis of symmetry is extracted.Pairs or triples of such invariant features are used to access tables where each table entry contains a linked-list of allthe database object models composed of the same invariant features. A vote is placed for each of the objects in thattable entry. By performing this process across all the extracted invariant features of the scene object, the table objectwith the most votes is selected as the recognized object.

Forsyth et al. [92] present a framework on invariant descriptors for 3D model-based vision. The authors surveythe large mathematics literature on projective geometry and its invariants, and apply these invariants to the recognitionproblem. One of the projective invariants that they discuss for example, involves the use of plane conics. A planeconic is given by the values of x satisfying xT cx = 0 where c is some symmetric matrix with a determinant of 1. Aprojective invariant of such a conic is given by the value

(xT1 cx2)2

(xT1 cx1)(xT

2 cx2)(3)

where x1, x2 are any two points not lying in the conic. In other words, the value in (3) is independent of the coordinateframe in which x1, x2 and the conic are measured, and is invariant under projective distortions. For example, this canbe useful for consistently detecting a car’s wheels from multiple viewpoints.

A problem with many geometric hashing techniques includes the fact that the resulting feature distributions arenot uniform, thus, slowing down the indexing mechanism by unevenly distributing the indices in the hash-table cells.Such problems are usually addressed by uniformly rehashing the table through the use of a distribution function thatmodels well the expected uneven distribution of features in the original hash-table.

Bayesian formulations of the problem are also useful in modelling positional errors in the hash-tables [94, 95]. Forexample, one can attempt to maximize the probability P(Mk, i, j, B|S ′) by using Bayes’ theorem to assign weightedvotes to the hash-table, where Mk is an object model, i, j are indices of two distinct points on the model whichcorrespond to two points from the basis set B (these two points can define an axis of the current coordinate frame)

20

Edge extraction

Detection of nonaccidental properties

Parsing of regions of concavity

Determination of Components

Matching of Components to Object Representations

Object Identification

Principle of Non-Accidentalness: Critical information is unlikely to be a consequence of an accident of viewpoint.

3-D space inference from image features

2-D relation 3-D inference Examples 1. Collinearity of Collinearity in 3-space points of lines 2. Curvilinearity of Curvilinearity in 3-space points of arcs 3. Symmetry Symmetry in 3-space 4. Parallel Curves Curves are parallel (over small visual in 3-space angles) 5. Vertices-two or more Curves terminate at a terminations at a common point in common point 3-space

L Fork Arrow

Figure 10: Hypothesized processing stages in object recognition according to Biederman [40]. The process relies on the extraction of 5 types ofnon-accidental properties from images, which in turn help in inferring the corresponding 3-D geons.

and S ′ is the set of extracted scene points which excluded the currently chosen basis points in B. The use of such aredundant vote representation scheme can also diminish the need to consider all possible model basis combinations invarious hashing and voting algorithms.

Geometric hashing algorithms constitute a proven methodology offering a rapid indexing mechanism. However,there is little work on bridging the semantic gap between the low-level features typically extracted from images, andthe high order representations that are ultimately necessary for recognition algorithms to work well with non-trivialobjects, while simultaneously maintaining the rapid indexing advantages of such hashing approaches.

2.6. Qualitative 3-D Shape-Based Recognition and Deformable ModelsDo humans recognize objects by first recognizing sub-parts of an object, or are objects recognized as an im-

age/whole in one shot? Is it perhaps the case that we first learn to recognize an object by parts, but as we becomemore familiar with the object, we recognize it as a whole? The answer to these questions could have profound impli-cations for the design of computer vision systems. A number of researchers have addressed this issue. This sectionoverviews some of the related research.

The combinatorial and minimum encoding length arguments from the previous sections provide a compellingargument as to the need for parts based recognition. It is combinatorially infeasible to achieve 3D recognition ofwholes without parts based recognition preceding it first. Pelli et al. [103] provide some compelling arguments insupport of the parts based approach. To support their arguments, the authors demonstrate that human efficiency inreading English words is inversely proportional to word length, where “efficiency” is defined as the ratio of the idealobserver’s threshold energy divided by a human observer’s threshold energy — threshold energy being the minimumenergy needed in the signal/word to make it observable. The authors demonstrate that despite having read billionsof words in their lifetime and the visual system having learnt them as well as it is possible, humans do not learnto recognize words as wholes. They demonstrate that efficiency decreases with increasing word length. If humansrecognized words as wholes, this effect should not be as pronounced. A word is never learnt as an independent featureand human performance never surpasses that achievable by letter based models. A word cannot be read unless itsletters can be separately recognized and its components are detected. This leads to some interesting ideas with respect

21

to purely feedforward approaches to recognition, [104], [105], [106]. As localization is very difficult using purelyfeedforward approaches and since some sort of localization on the object is necessary to recognize the individualparts, an attention mechanism [29] is necessary in order to provide this localization/parts-based information.

It is, however, unknown what the components used in parts based recognition are. To this extent, numeroushypotheses have been formulated, which attempt to explain the components used in parts based recognition. However,as many years of research on the subject have shown, the extraction of such parts from 2D images is far from trivial,and depends strongly on the image complexity and the similarity of the image features to the finite set of object parts.

Within the context of the recognition framework in Fig.1, 3D shapes and deformable models are believed toprovide an extremely powerful indexing mechanism. Their main limitation is, however, the difficulty in extractingand learning such representations from 2D images. As a result in modern work, such 3D part-based representationsare not very popular (see Table 7 for example). Nevertheless, many researchers believe that such 3D representationsmust play a significant role in bridging the semantic gap of recognition systems (see the Tarr-Biederman debate inSec.2.1). As we will notice, there has been little effort in modern work for merging such 3D parts based representationwith modern view-based methodologies relying on texture, local features and advanced statistical learning algorithms.

Fischler and Elschlager [107] present an early system where a reference image is represented by p components andalso associate a cost with the relative displacement/deformation in the spatial position of each component. Biederman[40] suggests that the components most appropriate for the recognition process are geons, which are generalized conessuch as blocks, cylinders, wedges and cones. A maximum of 36 such geons are suggested. He argues in support ofthe recognition-by-components approach to vision. The author maintains that these geons are readily extractable fromfive detectable edge based properties of images: curvature, collinearity, symmetry, parallelism and cotermination (seeFig. 10). Biederman claims that, since these properties are invariant over viewing directions, it should be readilypossible to extract geons from arbitrary images. Years of research in the field have demonstrated, however, that theextraction of geons from arbitrary images is a non-trivial task and most likely more sources of regularization areneeded if reliable extraction from images of such high-complexity objects is to be achieved. This early researchby Biederman on recognition-by-components influenced significantly the computer vision community, and spurreda number of years of intense research in the field. The author argues that the human visual system recognizes amaximum of 30,000 object classes, by using an English language dictionary to approximate the number of nouns inthe English language. Notice, however, that one could also argue that humans are capable of distinguishing amongstmany more than 30,000 objects (millions of objects) since for each such noun, humans can effortlessly distinguishamongst many sub-classes (e.g., the noun ‘car’ has multiple distinguishable sub-categories which are not enumeratedin a typical English language dictionary). By simple combinatorial arguments Biederman shows that combinationsof 2-3 geons should be more than sufficient to provide accurate indices for recognition. A number of experimentsare performed with human subjects, demonstrating the ease with which humans can recognize real life object classesthat are represented by the 2D projection of a composition of 2-3 geons. Tanaka is well known for his research onuncovering the neuronal mechanisms in the inferotemporal cortex related to the representation of object images [108].

Biederman’s approach is in some ways similar to Marr’s [23] paradigm for object inference from 2-D images.Marr’s approach, however, relies on the extraction of 3-D cylinders, rather than geons, from images. In more detail,Marr proposes three main levels of analysis in understanding vision. These include a primal sketch of a scene consist-ing of low-level feature extraction, followed by a 2.5D sketch where certain features which add a sense of depth mightbe added to the primal sketch, such as cast-shadows and textures, followed by the above described 3-D cylinder basedrepresentation of the objects in the scene.

Even though Biederman’s 36 geons are inherently three-dimensional, he notes that he is not necessarily supportingan object-centered approach to recognition. He argues that since the 3D geons are specifiable from their 2D non-accidental properties, recognition does not need to proceed by constructing a three-dimensional interpretation of eachvolume. Note that the belief that a combination of object-centered and viewpoint-dependent recognition takes placein the human visual system, is currently more widely accepted in the vision community. Biederman also arguesthat the recognition-by-components framework can explain why modest amounts of noise or random occlusion, suchas a car occluded by foliage, do not seem to significantly affect human recognition abilities, as geon-like structuresand the extraction of non-accidental properties provide sufficient regularization to the problem. Biederman is carefulto indicate that he is not arguing that cues such as color, the statistical properties of textured regions, position ofthe object in the scene/context do not play a role in recognition. What he is arguing in support of, is that geon-like structures are essential for primal access: the first contact/attempt at recognition that is made upon observing a

22

Figure 11: Examples of various generated superquadrics.

stimulus and accessing our memory for recognition. Thus, within the context of the general recognition framework ofFig.1, Biederman is proposing a potentially powerful indexing primitive. However, it is not yet clear how to reliablyand consistently extract these primitives from a regular 2D image, nor is it clear what the optimal algorithm is forlearning an object representation that is composed of such parts.

Pentland [109, 110] presents another parts-based approach to modelling objects using superquadrics (see Fig. 11).Let cos η = Cη and sinω = S ω. Then, a superquadric is defined as

X(η, ω) =

a1Cε1η Cε2

ω

a2Cε1η S ε2

ω

a3S ε1η

(4)

where X(η, ω) is a 3D vector that defines a surface parameterized by η, ω. Furthermore, ε1, ε2 are constant parameterscontrolling the surface shape and a1, a2, a3 denote the length, width and breadth. By deforming a number of suchsuperquadrics, and taking their Boolean combinations, a number of solids are defined.

The authors propose a methodology for extracting such superquadrics from images, assuming the existence ofaccurate estimates of the surface tilt for each pixel of the image, where the tilt is defined as τ = xn/yn where xn, yn

respectively denote the x and y axis components of the surface normal. Through a simple regression methodology theyshow how the superquadric’s center, orientation and deformation parameters can be reliably estimated. The idea is thatthe relatively compact description of each superquadric can provide a good methodology for indexing into a databaseof objects and identifying the object from the image. In practice, however, extracting such superquadrics from imageshas been met with little success. However, superquadrics have been successfully applied to other domains where thereis significantly less variability in the image features, such as the medical imaging domain and the segmentation of thecardiac ventricles [111, 12].

Within the context of the general object recognition framework of Fig.1, the work on superquadrics that wasreviewed so far has not dealt with all the modules of the standard recognition pipeline. As indicated previously,while it is difficult to extract a one-to-one map from an image of an object to a superquadric-based representation,one-to-many mappings may exist that provide a sufficiently discriminative and efficiently learnable representation[26]. Dickinson and Metaxas [112] present another superquadric based approach to shape recovery, localization andrecognition that addresses to a greater extent the components of Fig.1, due to the use of a hierarchical representationof objects. They first use an aspect hierarchy to obtain segmentations of the image into likely aspects (see Fig. 12).These aspects in turn are used to guide a superquadric fitting on the 2D images. The superquadric is fit on the extractedaspects by fitting a Lagrange equation of motion

Mq + Dq + Kq = f (5)

where q is a vector containing the superquadric parameters and parameters for rotation and translation, and f is a vectorof image forces which control the deformation of the differential equation. These forces depend on the extracted image

23

Aspects

Faces

Boundary groups

Asp

ect

hie

rarc

hy

Primitives

Links indicate possible parent faces of boundary groups

Links indicate possible parent aspects of faces

Links indicate possible parent primitives of aspects

Figure 12: The aspect hierarchy used by Dickinson and Metaxas [112].

Figure 13: A 3-D active appearance model (AAM) used in [12] to model the left ventricle of the heart. The right image shows a 3D modelof the left ventricle of the heart, which captures the modes of shape deformation during the cardiac cycle as well as the corresponding imageappearance/intensity variations. This model can be deformed to better fit the data in the volumetric images, and thus achieve better segmentation.The left image shows a short-axis cardiac MRI slice whose intensity is modelled by the AAM. A stack of such images produces a 3D volumetricimage.

aspects. The extracted superquadrics provide a parts-based characterization of the image and also provide a compactindexing mechanism.

Sclaroff and Pentland [113] provide a different formulation of a deformable model. Given a closed parameterizedcurve, which represents the outline of a segmented object, they decompose the object into its so-called “modes ofdeformation”. The model’s nodes’/landmarks’ displacement vector U is modelled using the Lagrange equation ofmotion

MU + DU + KU = R (6)

where R denotes the various forces acting on the model and causing the deformation — such as edges and lines —and M, D, K denote the element mass, damping and stiffness properties respectively. It is shown how to use thisdifferential equation to obtain a basis matrix Φ of m eigenvectors Φ = [φ1, ..., φm]. Linear combinations of theseeigenvectors describe the “modes of deformation” of the differential equation, thus, allowing us to deform the modeldisplacements U until the best matching model is found. Given a model object whose contour is described by a finitenumber of landmarks, the authors present a formulation for determining the displacement U that best matches thoselandmarks. If the matching is sufficiently good, we say that the object has been recognized.

Cootes et al. introduced Active Shape Models (ASMs) and Active Appearance Models (AAMs) [114], [115],which are also quite popular in the medical imaging domain [111, 12] (see Fig. 13). While superquadric basedapproaches use a very specific shape model, AAMs and ASMs try to learn a shape model from general data with regu-larization. As previously discussed, such 3D parts-based primitives offer a potentially powerful indexing mechanism,but are often difficult to extract reliably from images.

24

2.7. Function and Context

Extensive literature exists on the exploitation of prior knowledge about scene context and object function, inorder to improve recognition. Function and context are related topics, since by definition, information about thefunction of a certain object implies that the system is able to extract some information about the scene context aswell. For example, knowledge that a particular set of objects can be used as a fork, spoon and plate increases thecontextual probability that we are in a kitchen and that edible objects might be close by, which can in turn helpimprove recognition performance. Conversely, contextual knowledge is strongly related to function since often thescene context (e.g., are we inside a house or are we outside, and what is the scale with which the scene is sensed?)could help us determine whether, for example, a car-like object is a small toy that is suitable for play, or whether it isa big vehicle that is suitable for transportation purposes.

Thus, within the context of the recognition pipeline in Fig.1 we see how function and context could in principleaffect all the components of the standard pipeline. For example, contextual knowledge could place a smaller burdenon the level of object representational detail required by the object database. Similarly, function and context couldaffect the feature extraction and grouping process when there is scene ambiguity due to sensor noise or occlusionsfor example. Similarly, context and function can prune the hypothesis space, and thus improve the reliability andefficiency of the object verification and object training phases. We also notice that related work is by necessity closelyrelated to knowledge representation frameworks.

Various knowledge representation schemes were implemented over the years in order to improve the performanceof vision systems, through the integration of task-specific knowledge [116], [117], [118],[119], [120], [121], [122],[123]. Contextual knowledge used by such system typically helps in answering certain useful questions such as:Where are we? Are we looking up or down? What kind of objects are typically located here? How will the objects inthe scene be used?

An early and influential knowledge representation framework is attributable to Minsky [124]. The essence ofMinsky’s frame theory is encapsulated in the following quotation: ”When one encounters a new situation (or makes asubstantial change in one’s view of the present problem) one selects from memory a structure called a Frame. This is aremembered framework to be adapted to fit reality by changing details as necessary frames provide us with a structuraland concise means of organizing knowledge.” Essentially frames are data structures for encoding knowledge andrepresent an early attempt at modelling the way humans store knowledge. As such they have significant applicationsin vision and influenced early vision research on context and function. Minsky argued that frames could providea global theory of vision. For example, he argued that frames could be used to encode knowledge about objects,sub-parts, their positions in rooms, and how these relations might change with changing viewpoint (is-a relations,part-of relations, part-whole relations and semantic relations). However, modern recognition research has moved toa learning/probabilistic based model for representing knowledge, which is typically represented in terms of graphicalmodels [125]. In Sec. 3 we will discuss how knowledge representation frameworks were also used in early researchon active object localization and recognition.

Along similar lines, research on exploiting function (a.k.a., affordances) has provided some promising results.Under this framework, the object’s function plays a crucial role in recognition. For example, if we wish to performgeneric recognition and be capable of recognizing all chairs, we need to identify a chair as any object on whichsomeone can sit. One can of course argue that this is no different from the classical recognition paradigm wherewe learn an object’s typical features, and based on those features try to recognize the object. It is simply a matterof learning all the different types of chairs. Nevertheless, and as discussed in the introduction, the huge amount ofvariation in objects implies that it is unreasonable to assume that an accurate geometric model will always exist.

In early work on function actions and affordances, there was no focus on how learning is related to the problem.In more recent work the confluence of learning for actions and affordances has gained prominence [127, 128]. Onecan think of many object classes (such as the class of all chairs) which contain elements that at least visually arecompletely unrelated. The only intermediate feature that such classes share is their function. The idea behind learning-based affordance/function research involves associating visual features with the function of the object, and then usingthe object function to improve recognition.

According to Gibson’s concept of affordances, the sight of an object is associated by a human being to the ways itis usable [129, 130]. It is believed that during early childhood development this association is primed by manipulation,randomly at first, and then in a more and more refined way [130]. According to this school of thought, a significant

25

HIGH LEVEL: Symbolic Description of Objects and Scenes

Control Strategies

INTERMEDIATE SYMBOLIC

REPRESENTATION

LOW LEVEL : Pixels-Arrays of Intensity, RGB, Depth, ... (static monocular, stereo pairs, motion sequences)

:Symbolic Description of Regions, Lines, Surfaces and other tokens extracted from the sensory data

Focus of Attention

Rules-Based Object Hypotheses

Information Fusion

Object Matching

Information Fusion

Control of Perceptual Organization

Segmentation

Feature Extraction

Goal Oriented Re-segmentation

Additional Features

Finer Resolution

Inference & Propagation of Belief

PERCEPTUALORGANIZATION grouping, splitting and deleting tokens

Stereo and motion to produce depth arrays

Figure 14: Overview of the VISIONS system, and the 3 main layers of representation used in the system. Adapted from Hanson and Riseman[126].

reason as to why human object recognition is reliable is because humans immediately associate to the sight of anobject its affordances, which results in strong generalization capabilities.

Recent work on affordances has also focused on its relation to robotics, by building systems that use vision basedsystems to determine how an object should be grasped and manipulated [131, 132]. While some success was achievedfor a small number of object classes, consistently reliable affordance-based grasping for a large number of objectclasses is not yet demonstrated. Notice that object grasping is a many-to-many relationship, since multiple objects aregraspable with the same grasp, and one object can be associated with multiple kinds of grasps [130].

An early example of a vision system which used a non-trivial knowledge base, is the VISIONS system by Hansonand Riseman [126] which was progressively developed and improved over a number of years (also see Fig. 14). Thissystem incorporated a knowledge representation scheme over numerous layers of representation in order to create anadvanced vision system [133]. At the highest level, their system consists of a semantic network of object recognitionroutines. This network is hierarchically organized in terms of part-of compositional relations which, thus, decomposeobjects into object parts.

Stark and Bowyer [117] present the GRUFF-2 system for function based recognition and present a function-based category indexing scheme that allows for efficient indexing. Given as input a polyhedral representation of anobject, they define a set of knowledge primitives which define the object. For example, a king sized bed is defined byknowledge primitives which specify the total sleeping area of the bed, the width of the sleeping area, the stability ofthe object — i.e., does it have a sufficient number of legs? — and so on. A hierarchy is defined for all the objectswe wish to recognize. For example, a chair has a number of subcategories (conventional chair, balance chair, loungechair etc.) and each subcategory might have another subcategory or it might be a leaf in the hierarchy, in which casethe set of knowledge primitives from the leaf to the root need to be verified to see if we are dealing with such anobject. Acceptable ranges are defined for each of the tested features — such as the acceptable total sleeping area ofthe object if we are to classify it as a bed — and based on the total score of all the ranges, a confidence measure isdefined on the hypothesized identity of the object. An indexing scheme is also proposed, which uses the results of theinitial input shape estimation to remove impossible categories. For example, if the total volume of the object is notwithin a specified bound, we know that a big subset of the objects in our database can be ignored as they do not havethe same volume bounds.

Similarly, a large amount of published research exists on context-based vision. Strat and Fischler [116] presentthe Condor recognition system (see Fig. 15). It is a system designed for recognizing complex outdoor natural scenes,by relying to a large extent on contextual knowledge without depending on geometric models of the learned objects.

26

Candidate Comparison

Clique Formation

Clique Selection

Candidate Generation

User Interface

Core Knowledge Structure

Context

Con

text

Context

Context

3D models

Candidates

Partial orders

Cliq

ues

Displays Target vocabulary

Figure 15: The Condor recognition system of Strat and Fischler [116].

The authors do not make certain assumptions that are inherent to many recognition systems. Namely, that all objectsof interest are definable by a small number of parts, and that all objects have well defined locally measurable features.Context sets are defined, which are sets of predicates/image features that, if satisfied, a certain action is taken. Anexample of a context set is for example SKY-IS-CLEAR, CAMERA-IS-HORIZONTAL, RGB-IS-AVAILABLE. Ifthe context set is true, a certain operator is executed which helps us determine whether a certain object (soil, trees,etc.) is present in the image. Such context sets are used in 3 different types of rules:

• Type I: Candidate generation

• Type II: Candidate evaluation

• Type III: Consistency determination.

Type I rules are typically entered manually. In the candidate evaluation phase a feature selection is performed andwhich context sets are most discriminative for each operator and object/class are determined. This makes it possibleto order the rules and thus, obtain in a more efficient manner the maximal set (referred to as the maximal clique in thepaper) of objects present in the image. The maximal clique must contain objects whose rules contradict each other aslittle as possible. Since it is intractable to enumerate all the sets in order to determine the best one, this ordering ofrules is a heuristic that makes it possible to find a good set within reasonable time.

Hoiem et al. [121] use probabilistic estimates of 3D geometry of objects relative to other objects in the sceneto make estimates of the likelihood of the various object hypotheses. For example, if a current hypothesis detects abuilding and a person in the image, but the extracted 3D geometry indicates that the person is taller than the building,the hypothesis is discarded as highly unlikely. Their approach can be incorporated as a “wrapper” method around anyobject detector. Markov random fields (MRFs) are also a popular method for incorporating contextual information viaspatial dependencies in the images [134]. In more recent work, Kumar and Hebert [135] use Discriminative RandomFields (DRFs), an extension of MRFs, for incorporating neighborhood/scene interactions. The main advantage ofDRFs is their ability to relax the conditional independence assumption of MRFs. A number of researchers use thestatistics of bags of localized features (edges, lines, local orientation, color, etc.) to determine the likely distributionof those features depending on the scene or current context [119], [120], [123], [122]. This is often referred to asextracting the gist of the image. Recent work has examined the use of graphical models as means of an event-basedgenerative modelling of the objects in an image [136], [137]. For example, Murphy et al. [136] use a conditionalrandom field for simultaneously detecting and classifying images. The authors take advantage of the correlationsbetween typical object placements to improve recognition performance (e.g., a keyboard is typically close to a screen).Instead of modelling such correlations directly, they use a hidden common cause, which they call the “scene”. Insubsequent sections we discuss how such contextual clues have also been used to improve the efficiency of next-view-planners in active vision systems.

27

0.0

0.00.0

0.5

0.5

0.5

-0.5

-0.5

-0.5e1e2

e3

Figure 16: Example of a potential transformation of an image to parametric eigenspace (from [141]). Each position on the manifold/contourrepresents the eigenbasis coordinates of an object from a certain viewpoint and illumination. Thus, if the transformed image lies close to themanifold, the image contains an instance of the object that the manifold represents.

In more recent work, the role of contextual knowledge extracted from the outputs of segmentation algorithms andlocal neighbourhood labelling algorithms was investigated [138, 139]. A recent evaluation of the role of context inrecognition algorithms is presented in [140].

There is a general consensus in the vision community that function and context contribute significantly to thevastly superior performance of the human visual system, as compared to the performance of artificial vision systems.As it is discussed in [26], function and context can play a significant role in attentional priming and during the learningprocess of new object detectors. Edelman [27] argues that the major challenges inhibiting the design of intelligentvision systems include the need to adapt to diverse tasks, the need to deal with realistic contexts, and the need toprevent vision systems from being driven exclusively by conceptual knowledge (which effectively corresponds totemplate matching, as previously described). Edelman argues that the use of intermediate representations instead offull geometric reconstruction, is a necessary condition for building a versatile artificial vision system.

2.8. Appearance Based RecognitionEarly research on appearance-based recognition used global low-level image descriptors based on color and texture

histograms [142]. See Niblack et al. [143], Pontil and Verri [144], Schiele and Crowley [145] for some relatedearly work. The introduction of appearance based recognition using Principal Component Analysis (PCA) arguablyprovided the first approach for reliable exemplar based recognition of objects under ideal imaging conditions (e.g.,no occlusion, controlled lighting conditions, etc.). The first breakthrough in the area arose with Turk and Pentland’s“eigenfaces” paper [146] which used PCA at the level of image pixels, to recognize faces. A slew of research onappearance based recognition followed. The work by Cootes et al. on Active Appearance Models [114], [115],constituted an early proof-of-concept on the applicability of such appearance based techniques within the context ofother vision-related tasks, such as in the tracking and medical imaging domains [111]. One of the first approachesusing PCA at the pixel level for recognition was Murase and Nayar’s work [141]. In contrast to the traditional approachto object recognition, the recognition problem is formulated as a problem of matching appearance and not shape. PCAprovides a compact representation of the object appearance parameterized by pose and illumination (see Fig. 16). Foreach object of interest, a large set of images is obtained by automatically varying pose and illumination. The imageset is compressed to obtain a low dimensional subspace called an eigenspace in which an object is represented asa manifold. The object is recognized based on the manifold it lies on. Every object is represented as a parametricmanifold in two different eigenspaces: The universal eigenspace, which is computed using image sets of all objects,imaged from all views and the object eigenspace which is a different manifold for each object, computed using onlyimages/views of a single object. Given an image consisting of an object of interest, the authors assume the object canbe segmented from the rest of the scene, and is not occluded by other objects.

After a scale and brightness normalization of all the views of a certain object, they transform the image pixelintensities into a vector. They call all the vectors for each view of an object, the object image set, and the unionof all object image sets the universal image set. The idea is that for the object image set Op of each object p, andfor the universal image set U, principal component analysis (PCA) is applied and eigenvectors explaining a certainpercentage of the image variation are retained — typically 90%-95%. Using PCA, an eigenbasis for the universalimage set is obtained, and a different eigenbasis for each object image set is also obtained. For each object image

28

set, the algorithm projects the images on the universal image set eigenbasis and on their respective object image seteigenbasis.

Notice that for each training image the algorithm knows the view from which the image was acquired. Thealgorithm also simulates variations in the illumination conditions of each image by using a single variable. Thus, foreach object image set, the projected point is parameterized in terms of the view from which the image was acquiredand in terms of the illumination conditions under which the image was acquired. By interpolation, two continuousfunctions are obtainable: g(p)(θ1, θ2), the appearance of object p in the universal eigenbasis from view θ1, illuminationθ2 and f (p)(θ1, θ2), the appearance of object p in the object eigenbasis from view θ1, illumination θ2. The projectionof an image on the universal eigenbasis gives z. The object recognition problem is reduced to the problem of findingthe object p in the universal eigenbasis which gives the minimum value for min θ1,θ2 ||z − g(p)(θ1, θ2)||. If we denote theprojection of the image on the eigenbasis corresponding to the recognized class p as z(p), the pose is determined byfinding the θ1, θ2 which minimize ||z(p) − f (p)(θ1, θ2)||. The above method has a number of advantages that are sharedby a big proportion of the appearance based recognition literature. It is simple, it does not require knowledge of theshape and reflectance properties of the object, it is efficient since recognition and pose estimation can be handledin real time, and the method is robust to image noise and quantization. It also shares a number of disadvantageswhich are also common in the appearance based recognition literature. It is difficult to obtain training data, since(i) there is a need to segment general scenes before the object training can happen, (ii) the method requires that theobjects are not occluded, (iii) the algorithm cannot easily distinguish between two objects that differ in one small butimportant surface detail and (iv) the method would not work well for objects with a high dimensional eigenspace anda high number of parameters, since the non-linear optimization problem in high dimensions is notoriously difficult.Nevertheless, despite these limitations, its training is much easier than that of manually trained systems, such as theones in Table 1. These training limitations are shared by most modern training algorithms which require labelledand segmented data (see Sec.4). Thus the algorithm compares favourably to many of the best performing recognitionsystems which require detailed manual annotations.

A number of researchers proposed algorithms for addressing these issues. Huang et al. [147] present an approachto recognition using appearance parts. Zhou et al. [149] use particle filters in conjunction with inter-frame appearancebased modelling to achieve a robust face tracker and recognizer under pose and illumination variations. Leonardis andBischof [148] use RANSAC to handle occlusion in a more robust way. Given a basis set of eigenvectors which describea training set of objects, the authors present a robust method of determining the coefficients of the eigenvectors whichbest match a target image. Using RANSAC, they randomly select subsets of the target image pixels, and find theoptimal eigenvector coefficients that fit those pixels. At each iteration, the worst fitting pixels are discarded — assources of occlusion or noise, for example — and the process is repeated. At the end, the algorithm calculates a robustmeasure of the eigenvector coefficients that best fit the image. These coefficients are used to recognize the object.

Thus, within the context of the pipeline in Fig.1, we see that appearance based approaches are strongly related tothe feature grouping module of recognition algorithms. Their power lies in the projection of the raw extracted featuresto a lower dimensional feature space which is more easily amenable to powerful classifiers. This demonstrates howpowerful feature grouping algorithms led to some of the earliest reliable exemplar-based recognition systems. InSec.3 we show how active vision approaches, in conjunction with dimensionality reduction algorithms, were provencapable of leading to a drastic reduction in the size of the object database (see Fig.1) needed for acceptable recognitionperformance.

2.9. Local Feature-Based Recognition and Constellation Methods

Local feature-based recognition methods gained popularity in the second half of the 1990’s mainly due to their ro-bustness in clutter and partial occlusion [150], [151], [72], [152], [153], [154], [155]. Inspired by the machine learningliterature and the introduction of promising new classifiers (SVMs for example) that could now run within reasonabletime frames on personal computers, researchers started investigating methods for extracting features from images andapplying machine learning techniques to identify the likely object from which those features were extracted.

Local-based features are useful for local or global recognition purposes. Local recognition is useful when wewant the ability to recognize the identity and location of part of an object. Local-based features are also useful for

29


Scalability Efficiency Efficiency Scalability Primitives or Context 3D TextureLamdan et al. [89] ** **** ** *** ** * * *Forsyth et al. [92] ** ** ** **** *** * ** *Flynn and Jain [93] ** ** * *** ** * *** *Fischler and Elschlager [107] * ** * **** ** * * *Biederman [40] * * * **** *** * *** *Marr [23] * * * *** *** * *** *Pentland [109] * * * **** *** * *** *Sclaroff and Pentland [113] ** * ** *** ** * * *Grabner et al. [127] * * * *** * **** **** *Stark et al. [128] ** ** ** ** ** **** * *Castellini et al. [130] ** ** ** ** ** **** * *Ridge et al. [131] ** ** ** ** ** **** *** *Saxena et al. [132] ** ** ** ** ** *** *** *Hanson and Riseman [126] ** ** ** ** *** *** *** ***Stark and Bowyer [117] * * * *** ** **** *** *Strat and Fischler [116] * * * * *** *** **** ***Hoiem et al. [121] *** ** ** ** *** *** *** *Kumar and Hebert [135] *** ** ** ** ** ** * *Torralba et al. [119] *** ** ** ** ** *** * *Torralba [120] *** ** ** ** ** *** * *Wolf and Bileschi [123] ** ** ** ** ** ** * *Li and Fei [137] ** ** ** ** ** *** * *Murphy et al. [136] *** ** ** ** ** *** * *j. Shotton et al. [138] *** ** *** *** ** *** * *Heitz and Koller [139] *** ** *** ** *** *** * ***Turk and Pentland [146] ** ** *** *** ** * * *Murase and Nayar [141] ** ** *** *** ** * * *Huang et al. [147] ** ** ** *** ** * * *Leonardis and Bischof [148] ** ** ** ** ** * * *


30

global recognition purposes, when we are not interested in the exact location of the object in an image, but we are justinterested in whether a particular image contains a particular object at any image location (classification). Such globalfeatures are particularly popular with Content Based Image Retrieval (CBIR) systems where the user is typicallyinterested in extracting global image properties/statistics (see Sec. 2.12). As discussed in more detail in Sec.4, imagecategorization/classification algorithms (which indicate whether an image contains an instance of a particular objectclass), are significantly more reliable than object localization algorithms whose task is to localize (or segment) froman image all instances of the object of interest. Good localization performance has been achieved for restricted objectclasses: in general there still does not exist an object localization algorithm that can consistently and reliably localizearbitrary object classes. In Chart 5 and Table 3 we present a comparison, along certain dimensions, for a number ofthe papers surveyed in Secs.2.9-2.10.

An early local feature-based approach to recognition is Rao and Ballard’s iconic representation algorithm [150],which extracts local feature vectors encoding the multiscale local orientation of various image locations. Thesevectors are used for recognition purposes. Lowe presents the SIFT algorithm [72], where a number of interest pointsare detected in the image using difference-of-gaussian like operators. At each one of those interest points, a featurevector is extracted. Over a number of scales and over a neighborhood around the point of interest, the local orientationof the image is estimated using common techniques from the literature. Each local orientation angle is expressedwith respect to the dominant local orientation, thus, providing rotation invariance. If a number of such features areextracted from an object’s template image, we say that the object is detected in a new test image if a number of similarfeature vectors are localized in the new test image at similar relative locations. Quite often such features are usedas elements of orientation histograms. A comparison of the similarity between two such histograms helps determinethe similarity between two shapes. See [156, 157] for some early precursors of such approaches. Currently suchapproaches are extremely popular in addressing the image classification problem, and we will discuss them in moredetail in Sec. 4. Thus, within the context of the recognition pipeline in Fig.1, we see that early work on local featureswas most closely related to the feature grouping module. In Sec.2.12 we will see why such features are also useful inreducing the object database storage requirements of content based image retrieval systems.

Mikolajczyk and Schmid present an affine invariant interest point detector [158] that provides a feature groupingprocedure that is more robust under certain image transformation, and can thus improve the reliability of the recog-nition modules in Fig.1 that depend on the feature grouping module. The local features are robust in the presenceof affine transformations and changes in scale, thus, providing invariance under viewpoint changes. Interest pointcandidates are firstly extracted using the multi-scale Harris detector. Then, based on the gradient distribution of theinterest point’s local neighborhood, an affine transformation is estimated that makes the local image gradient distribu-tion isotropic and that corrects the displacements in the interest point locations across scales due to the Harris detector.Once these isotropic neighborhoods are obtained, any typical feature based recognition approach could be used. Theauthors estimate a vector of local image derivatives of dimension 12, by estimating derivatives up to 4th order. A sim-ple Mahalanobis comparison formulates hypotheses of matching interest points across two images, and a RANSACbased approach further refines the correspondences and provides a robust estimate of a homography describing thetransformation between the two images. Using this homography, if there is a sufficient number of matching interestpoints, the two images match, potentially recognizing the object in one image if the object present in the other image isknown. Similar approaches are described elsewhere for affine invariant matching, wide-baseline stereo and multiviewrecognition [159], [160], [161], [162], [163], [164]. Local image-based features are also used for vision based local-ization and mapping with some success [165], [166], [167] and are currently quite popular in content-based imageretrieval [168].

Belongie et al. [155] present a metric for shape similarity and use this to find correspondences amongst shapes thatare describable by well defined contours, such as letters, and logos. Each shape’s outline is discretized into a numberof points, and for each point a log-polar histogram is built of the location of the other shape points with respect to thispoint (see Fig. 17). To compare the correspondence quality of two points pi, q j on two different objects, it suffices

to compare their respective histograms using the χ2 test statistic Ci j ≡ C(pi, q j) = 12∑K

k=1[hi(k)−h j(k)]2

hi(k)+h j(k) where hi(k)denotes the kth entry in the histogram of the relative coordinates of the points on the shape contour, where the relativecoordinates are expressed with respect to the ith point on the shape contour. A bipartite graph matching algorithm,then, searches for the best matches amongst all the landmarks on the two shapes.

Csurka et al. [169] and Sivic and Zisserman [170] introduced the “bag-of-features” approach for recognition, an

31

Figure 17: Belongie’s algorithm [155]. The contour outline of an object is discretized into a number of points, which are in turn mapped onto alog-polar histogram.

Chart 5: Summary of the 1995-2012 papers from Table 3. We notice that the local feature-based representation,constellation method, grammar and graph representation papers surveyed in the respective sections are mostly focusedon inference and training efficiency, encoding scalability and expanding the diversity of the indexing primitives used.There is no consistent effort amongst this group of papers in simultaneously modelling 3D representations, textureand function/context.

influential and efficient approach for recognition which was widely adopted by the community. The main advantagesof the framework is its simplicity, efficiency and invariance under viewpoint changes and background clutter, whichtypically results in good image categorization. The framework has four main steps: (i) Detection of image patches(ii) Using the descriptors of image patches to assign them to a cluster of mined clusters (iii) counting the number ofkeypoints/features assigned to each cluster and (iv) treating the bag of features as a feature vector and using a classifierto classify the respective image patch.

Grauman and Darrell [171] use a “bag-of-features” type of an approach to recognition. They extract SIFT featuresfrom a set of images and then, define a pyramid-based metric which measures the affinity between the features inany two images. A spectral clustering based approach clusters the images based on this affinity, providing a semi-supervised method for determining classes from a set of images. Each cluster is then further refined, removing anypotential outliers from the clusters.

Another bag-of-features type of an approach is presented in Nister and Stewenius [75], where the authors presenta recognition scheme that scales well to a large number of objects. They present an online test suite using a 40,000image database from music CD covers. The authors also tested their system using a set of 6376 labelled images thatwere embedded in a dataset of around 1 million frames captured from various movies. Features are extracted by usingthe Maximally Stable Extremal Region algorithm [164] to locate regions of interest, followed by fitting an ellipse toeach such region, and followed by transforming each ellipse into a circular region. Then SIFT features are extracted

32


Scalability Efficiency Efficiency Scalability Primitives or Context 3D TextureRao and Ballard [150] * ** ** ** ** * * *Lowe [72] ** ** ** ** ** * * *Mikolajczyk and Schmid [158] *** ** ** ** *** * * *Belongie et al. [155] *** ** ** ** *** * * *Csurka et al. [169] *** ** ** ** *** ** * *Grauman and Darrell [171] ** ** **** *** *** * * *Nister and Stewenius [75] **** *** *** *** *** * * *Sivic and Zisserman [170] *** ** N/A ** *** * * *Kokkinos and Yuille [172] *** ** *** ** **** * * *Lampert et al. [173] ** **** ** ** ** * * *Fergus et al. [100] ** ** **** ** *** * * *Fergus et al. [174] ** ** **** ** *** * * *Sivic et al. [175] *** ** **** ** *** ** * *Ullman et al. [176] ** ** ** *** *** * * *Felzenszwalb and Huttenlocher [177] ** *** *** **** ** ** * *Leibe and Schiele [178] ** ** ** *** ** * * *Li et al. [179] *** ** ** *** ** * * *Ferrari et al. [180] *** ** ** ** *** * * *Siddiqi et al. [181] ** * ** **** ** * * *LeCun et al. [182] *** ** ** ** *** * * *Ommer et al. [183] ** ** ** *** *** * * *Ommer and Buhmann [184] ** ** ** *** *** * * *Deng et al. [185, 186] **** *** *** **** *** ** * ***Bart et al. [187, 188] *** ** **** *** *** ** * ***Le et al. [189] *** ** **** *** *** * * *


33

from these normalized regions and they are quantized using a vocabulary tree algorithm. Effectively, the vocabularytree uses hierarchical k-means to create a feature tree, where at each layer of the tree, the features are grouped intok subtrees. Each node of the tree is assigned an information theoretic weight wi = ln(N/Ni), where N is the numberof images in the database and Ni is the number of training images with at least one quantized vector passing throughnode i in the tree. A query image is matched with the database images by extracting all the feature vectors from thequery image, and then finding the path in the tree that best matches each feature vector. Each node i of the tree isweighed by the number of query image vectors that traverse the corresponding node i, and this provides a vector whichis matched with each database image vector. The matching provides the vector’s “distance” to the closest matchingdatabase images. It is surprising that the algorithm gives good performance despite the fact that information about therelative position of the various features is discarded. This reinforces the point discussed elsewhere in this survey, thatdetection algorithms which do not attempt to actually localize the position of an object in an image, tend to performbetter than localization algorithms. Thus, within the context of the pipeline in Fig.1, we see that this work makes aproposal on how local features could improve the feature grouping, object hypothesis and object verification phases,compared to a baseline feature based approach.

Sivic and Zisserman [170] present a method inspired by the text retrieval literature, for detecting objects andscenes from videos. A number of affine invariant features are extracted and an index is created using those features.A Mahalanobis distance metric is used to cluster these features into “visual words”, or frequent features. Thosevisual words are used to achieve recognition. For a given image/document, each visual word is assigned a weight ofimportance which depends on the product of the frequency of the word in the document with another number whichdownplays words that appear frequently in the database. Given a query vector of the visual words in an image/videosequence, and a set of visual word vectors with their weights, extracted from our database of videos, a matching scoreis based on the scalar product of query vector with any database vector. The authors also discuss various ways inwhich the weights could affect the matching score/ranking of images/videos.

Kokkinos and Yuille [172] present scale invariant descriptors (SIDs) and use these descriptors as the basis of anobject recognition system that detects whether certain images contain cars, faces or background texture. The authorsuse a logarithmic sampling (centered at each pixel of the image) that is similar to the human visual front-end. As aresult the image region around each pixel is parameterized in terms of a logarithmically spaced radius r and a rotationvalue u. The authors show that as a result of the non-uniform scale of spatial sampling, it is possible to obtain featurevectors that are scale and rotation invariant. These feature vectors depend on the amplitude, orientation and phase ateach corresponding image position. They are obtained by transforming the corresponding amplitude, orientation andphase maps of each image into the Fourier domain, resulting in orientation and scale invariance. These feature vectorsare in turn used as the basis of an object detector: The authors describe a methodology for extracting candidate sketchtokens from training images and describing their shape and appearance distributions in terms of SIDs, which in turnenables object detection to take place.

Lampert shows how a branch and bound algorithm can be used in conjunction with a bag-of-visual-words modelin order to achieve efficient image search, by circumventing the sliding window approach that has dominated much ofthe literature [190, 191, 173, 192]. The algorithm is able to localize the target of interest in linear or sublinear time.The authors also show how classifiers, such as SVMs, which were considered too slow for similar localization tasks,can be used within this framework for efficient object localization. This resulted in significant efficiency improvementsin the hypothesis generation phase of the algorithm (see Fig.1), which contributed to its popularity.

Local feature-based approaches have been successfully applied to a number of tasks in computer vision. However,it is generally acknowledged that significantly more complex types of object representations are necessary to bridgethe semantic gap between low level and high level representations, which we discussed in Secs.1, 2.1. It is importantto keep in mind that some recent work questions whether popular object representations and recognition algorithmsdo indeed offer superior performance, as compared to other much simpler algorithms, or whether this differencein performance is usually just an artifact of biased datasets (see Pinto et al. [193], Torralba and Efros [194], andAndreopoulos and Tsotsos [73]). As discussed in the above papers, the empirical evidence is clearly pointing tothe fact that a common thread of most recognition algorithms is their fragility and their inability to generalize innovel environments. This is an indication that there is significant room for breakthrough innovations in the field.While local features are the only thing that can be observed reliably, there is a significant on-going discussion onthe situations when these local features need to be tied together to obtain more complex representations [51]. Thistopic of local features vs. scene representation tends to re-emerge within the context of the semantic gap problem.

34

For example, when dealing with commercial vision systems that are expected to mine massive datasets using a largenumber of object classes (see Sec. 2.12 on content based image retrieval systems) there is an inverse relationshipbetween an increase in the system’s efficiency and its reliability: as the complexity of the extracted features and theconstructed scene representations increases, the computational resources and the amount of training data required caneasily become intractable [26].

Constellation methods are “parts-and-structure” models for recognition that lie at the intersection of appearance/local-feature-based methods and parts-based methods. They represent an attempt to compensate for the simplicity of local-feature-based representations, by using them to form dictionaries of more complex and compact object representations[195, 196, 179, 177, 176, 197, 178, 180, 198, 199, 100, 174, 200, 201, 202, 203]. As such they represent an evolu-tion in the local-feature-based grouping phase of the pipeline in Fig.1. Within this context, the work by Fischler andElschlager [107] which was previously discussed, could also be classified as falling within this category since it relieson a parts-based representation of objects. In early work on recognition, when referring to parts-based approaches,authors were often referring to 3-D object parts (superquadrics, cylinders, deformable models, etc..), while more re-cent local-feature-based approaches are mostly used to from 2-D parts-based representations of objects. In general,sophisticated learning techniques have been applied to a much greater extent on local-feature-based object represen-tations and constellation methods. This differentiates much of the literature on 3-D and 2-D parts representations ofobjects.

An advantage of many constellation methods is that they are learnt from unsegmented training images and theonly supervised information provided is that the image contains an instance of the object class [174]. It is not alwaysnecessary for precise object localization information to be provided a priori of course. However, the less extrane-ous/background scene information present in the training images, the better the resulting classifier. Typically, thisis achieved through latent variable estimation using the EM algorithm. A disadvantage of such approaches is thattheir training can sometimes be quite expensive. For example, many formulations of constellation methods, typically,require fully connected graphs, where the graph nodes might represent local features or parts. As a means of sim-plifying such problems, authors often use various heuristics to decrease the connectivity of the related graphs or tosimplify other aspects of the problem. The published literature does not tend to distinguish its recognition algorithmsas exemplar or generic. The spectrum of object categories, encountered in the literature, is as general as that of carsand as specific as that of cars with a particular pattern of stripes on it. Typically, as it is common in the literature,only successful approaches are published making it difficult to understand why a particular approach that works wellin one situation might not work so well in another. We discuss this topic in more detail in Sec.4.

Fergus, Perona and Zisserman are arguably some of the strongest advocates of the approach and have publisheda series of papers on constellation methods, some of which we overview here [100], [195], [174], [200], [203].Their work is an early example of view-based approaches combined with a graphical model based learning andrepresentation framework. Within the context of Fig.1, their papers represent a characteristic example of an effortto use low level indexing primitives to learn progressively more complex primitives (so called words). In [100] theshape, appearance, relative scale of the parts, and potential occlusion of parts is modelled. For each image pixeland over a range of image scales the local saliency is detected and the regions whose saliency is above a certainthreshold are the regions from which the features used for recognition are extracted. The saliency metric is a productof the image intensity histogram’s entropy over an image radius determined by the scale, weighed by the sum overall image intensities of the rate of change of the corresponding intensity channel as the scale varies. The training iscompletely unsupervised, which is the main strength of the paper. However, the method’s training is extremely slowas a 6-7 part model with 20-30 features, using 400 training images, takes 24-36 hours to complete on a Pentium 4PC. A number of short-cuts have been proposed that improve the training times. The number of parts P is specifieda priori. To each part, the algorithm assigns a feature out of the N features in the image. The features not assignedto a part are classified as belonging to the background, and therefore, are irrelevant. The object’s shape is representedas a Gaussian distribution of the features’ relative locations and the scale of each part with respect to a referenceframe is also modelled by a Gaussian distribution. Each part’s appearance is modelled as a 121-dimensional vectorwhose dimension is further decreased by applying PCA on the set of all such 121-dimensional vectors in our trainingset. A Gaussian distribution is then used to model each part’s appearance. As it is common with such constellationmethods, an EM algorithm is applied to determine the unknown parameters (shape mean, shape covariance matrix,each part’s scale parameters, part occlusion modelling, appearance mean and covariance matrix). As the E-step of theEM algorithm would need to search through an exponential number of parameters (O(NP)), the A∗ search algorithm

35

x1

x2

x3

x4

x5

x6

x1

x2

x3

x4

x5

x6

Figure 18: (left): A fully connected graphical model of a six-part object. (right): A star model of the same object as proposed by Fergus et al.[174].

is applied to improve the training complexity. Once the training is complete, the decision as to whether a particularobject class O is present in the image is done by maximizing the ratio of probabilities p(O|parameters)

p(¬O|parameters) where parametersdenotes all the parameters estimated during training with the EM algorithm. Fei-Fei Li et al. have picked up on thiswork and published numerous related papers. In Li et al. [197] for example, the authors use an online version of theEM-algorithm so that the model learning is not done as a batch process.

Fergus et al. [195] extend their approach by also encoding each part by its curve segments. A Canny edge operatordetermines all the curves, and each curve is split into independent segments at its bi-tangent points. On each suchcurve a similarity transformation is applied so that the curve starts at the origin and ends at (1,0). The curve endpointpositioned at the origin is determined by whether or not its centroid falls beneath the x-axis. By evenly sampling eachcurve at 15 points along its x-axis, a 15-dimensional feature vector of the curve is obtained and is modelled by a 15dimensional Gaussian. The model is again learnt via the EM algorithm. The training data set used contains valid dataof the object we wish to learn but it might also contain background irrelevant images. RANSAC is used to fit a numberof models and determine the best trained model. By applying each learned model on two datasets — one containingexclusively background/irrelevant data and the other containing many correct object instances — the best model ischosen based on the idea that the best model’s scoring function p(O|parameters)

p(¬O|parameters) should be the lowest on the backgrounddata and the highest on the data with valid object instances. The algorithm is used to learn object categories fromimages indexed by Google’s search engine.

In [200] the authors use the concept of probabilistic Latent Semantic Analysis (pLSA) from the field of textualanalysis to achieve recognition. If we have D documents/images and each document/image has a maximum of Wwords/feature types in it, we can denote by n(w, d) the number of words of type w in document d. If z denotes thetopic/object, the pLSA model maximizes the log likelihood of the model over the data:

L =

D∏d=1

W∏w=1

P(w, d)n(w,d) (7)

where

P(w, d) =

Z∑z=1

P(w|z)P(z|d)P(d) (8)

and Z is the total number of topics/objects. Again the EM algorithm is used to estimate latent variables and learn themodel densities P(w|z) and P(z|d). Recognition is achieved by estimating P(z|d) for the query images.

In [174] the authors address the previously mentioned problem of a fully connected graphical model representingall the possible parts-features combinations. By using a Star model to model the probability distributions, the com-plexity is reduced to O(N2P). In the Star model (see Fig. 18) all the other object parts are described with respect toa landmark part. If the position of the non-landmark parts is expressed with respect to the position of the landmarkparts, translation invariance is also obtained. The authors also obtain scale invariance by dividing the non-landmark’slocation by the scale of the landmark’s position. The rest of the approach is similar to [100].

Sivic et al. [175] present a method based on pLSA for detecting object categories from sets of unlabelled images.The same objective function as in Eq.(7) is used, where now the EM algorithm is used to maximize the objectivefunction and discover the topics/object classes corresponding to a number of features. The features used by the

36

First-Order (protrusion)

Second-Order (neck)

Third-Order (bend)

Fourth-Order (seed)

Figure 19: The four types of shocks, as described by Siddiqi et al. [181].

authors are SIFT-like feature vectors. Two types of affine covariant regions are computed in each image, using variousmethods described in the literature. One method is based on [158] which we described above. For each such ellipticalregion a SIFT like descriptor is calculated. K-means clustering is applied to these SIFT descriptors to determinethe “words” comprising our data set. The authors demonstrate that even though this is a “bag of words” type of analgorithm, it is feasible to use the algorithm for localizing/segmenting an object in an image. The authors demonstratethat doublets of features can be used to accomplish this.

Ullman et al. [176] present an approach that uses a constellation of face-part templates (eyes, mouth, nose, etc.)for detecting faces. The templates are selected using an information maximization based approach from a trainingset, and detection is achieved by selecting the highest scoring image fragments under the assumption that the objectis indeed present in the image.

Felzenszwalb and Huttenlocher [177] present a recognition algorithm based on constellations of iconic featurerepresentations [150] that can also recognize articulated objects. The work is motivated by the pictorial structuremodels first introduced by Fischler and Elschlager [107]. The authors use a probabilistic formulation of deformableparts which are connected by spring-like connections. The authors indicate that this provides a good generative modelof objects which is of help with generic recognition problems. The authors test the system with person-trackingsystems.

Leibe, Schiele and Leonardis [178], [198], [199] model objects using a constellation of appearance parts to achievesimultaneous recognition and segmentation. Image patches of 25 × 25 pixels are extracted around each interest pointdetected using the Harris interest point detector. Those patches are compared to codebook entries of patches whichwere discovered by agglomerative clustering on a codebook of appearance patches of an object of interest. Thesimilarity criterion is based on the normalized grey-scale correlation. From each such cluster its center is selectedas the representative patch for the center. Once an image patch is matched to a codebook entry, that codebook entrycasts votes for the likely objects it might have come from and places a vote in the image of the object center withrespect to the object patch. This voting mechanism is used to select the most likely object identity. The likely objectis backprojected onto the image and this provides a verification and segmentation of the object. In [198] this work isextended to achieve a greater amount of scale invariance. Opelt et al. [196] use a similar approach only that instead ofusing patches of appearance, they use pairs of boundary fragments extracted using the Canny operator in conjunctionwith an Adaboost based classification. Li et al. [179] present what amounts to a feature selection algorithm for

37

Figure 20: The shock trees of two objects (top row) and the correspondences between the trees and the medial axis of two views of an object(bottom row). Adapted from Siddiqi et al. [181].

selecting the most meaningful features in the presence of a large number of distracting and/or irrelevant features.Ferrari et al. [180] present a method for recognition which initially detects a single discriminative feature in the imageand by exploring the image region around that feature, slowly grows the set of matching image features.

From this survey on local features and constellation methods, we see that most research efforts in the field havebeen applied to the feature grouping phase of the pipeline in Fig.1. In Sec.3 we will discuss a number of activerecognition approaches, which in conjunction with local feature based approaches and constellation methods, form analternative framework for viewpoint selection, object hypothesis formation and object verification (Fig.1).

2.10. Grammars and Related Graph Representations

An often encountered argument in linguistics, refers to the need to use a sparse set of word representations in anygiven language, as a means of ensuring redundancy and efficient communication, despite the existence of potentiallyambiguous basic speech signals [204, 205]. As it was first argued by Laplace [206], out of the large set of words thatcould be formulated by taking random and finite-length arrangements of the letters in any popular alphabet (such asthe Latin or Greek alphabets) it is this sparsity of chosen words and the familiarity associated with some subset of theword, that makes a valid word stand-out as a non-random arrangement of letters.

This has motivated the vision community to conduct research into the use of grammars as a means of compactlyencoding the fact that certain parts of an image tend to occur more often in unison than in random. This in turnprecipitates the construction of compact representations, with all the associated benefits [26]. Thus, grammars providea formalism for encoding certain recurring ideas in the vision literature, such as using 2-D and volumetric-parts forconstructing compact object representations, as we have earlier discussed. As we will demonstrate in this section,the parse trees associated with a particular grammar, provide a simple graph-based formalism for matching objectrepresentations to parsed-image representations, and for localizing objects of interest in an image. It is important topoint out that in practice, the published literature does not tend to distinguish the recognition algorithms as beingexemplar or generic. An early identification of the task and scope that a particular algorithm is meant to solve, canaffect the graph based recognition architecture used. Thus, within the context of the pipeline in Fig.1, grammars aremeant to offer a compact, redundant and robust approach to feature grouping.

38

More formally, a grammar consists of a 4-tupleG = (VN ,VT ,R, S ) where VN ,VT are finite sets of non-terminal andterminal nodes respectively, S is a start symbol, and R is a set of functions referred to as production rules, where eachsuch function is of the form γ : α→ β for some α, β ∈ (VN ∪ VT )+. A language associated with a grammar G denotesthe set of all possible strings that could be generated by the applications of compositions of production rules fromthis grammar. A stochastic grammar associates a probability distribution with the grammar’s language. Given a stringfrom a language, the string’s parse tree denotes a sequence of production rules associated with the correspondinggrammar, which generate the corresponding string. Image grammars use similar production rules, in order to definein a compact way generative models of objects, thus, facilitating the generalizability of object recognition systemswhich use such production rules for the object representations. An interesting observation from Table 3 is that verylittle work has been done on grammars and graph representations that simultaneously incorporate function, context,3D (both in sensing and object representations), texture and efficient training strategies. This is good indication thatwithin the context of graph models and hierarchical representations, the previously discussed semantic-gap problemfor bridging low level and high level representations is still open. As it will be discussed in Sec.3, within the activevision paradigm a number of similar problems emerge.

Zhu and Mumford [205] classify the related literature on image grammars into four streams. The earliest streamis attributed to Fu [207] who applied stochastic web grammars and plex grammars to simple object recognition tasks.These web and plex grammars are generalizations of the linguistic grammars earlier discussed, and are meant toprovide a generalization of the standard grammars to 2-D images .

The second stream is related to Blum’s work on medial axes. Inspired by Blum’s argument [70, 208] that medialaxes of shape outlines are a good and compact representation of shape, Leyton [209] developed a grammar for growingmore complex shapes from simple objects. More recent work has expanded the scope of graph based algorithms usingshock graphs [181], [210], [211], [212], [213], [214], [215]. Inspired by Blum’s concept of the medial axis and giventhe significance that symmetry plays in parts-based-recognition systems — symmetry in generalized cylinders/geonsfor example — algorithms have appeared for encoding the medial axis of an object in a graph structure and matchingthe graph structures of two objects in order to achieve recognition. Shock graphs encode the singularities that emergeduring the evolution of the “grassfire” that defines the skeleton/medial-axis of the object. These are the “protrusion”,“neck”, “bend” and “seed” singularities. Thus, these shocks can be used to segment the medial axis into a tree-likestructure.

Fig. 19 provides examples of the four kinds of shocks that are typically encountered in an object’s medial axis. Byencoding those shocks into a tree-like structure (see Fig. 20) the recognition problem is reduced to that of graph/treematching. As this corresponds to the largest subgraph isomorphism problem, a large portion of the research hasfocused on efficient techniques for matching two tree structures and making them robust in the presence of noise. Aninteresting matching algorithm is proposed by Siddiqi et al. [181]. The authors represent a shock tree using a 1-0adjacency matrix. The authors show, that finding the two shock subtrees whose adjacency representation has the sameeigenvalue sum, provides a good heuristic to finding the largest isomorphic subtrees and thus, achieves recognitionvia matching with a template object’s shock tree (see Fig. 20). In general, shock graphs provide a powerful indexingmechanism if segmentations/outlines of the desired objects are provided. Using such approaches on arbitrary imageshowever, requires a segmentation phase, which in turn remains an unsolved problem and is also probably the mostfundamental problem in computer vision.

The third stream proposed by Zhu and Mumford [205] refers to more recent work which was inspired fromGrenander’s work on General Pattern Theory [97]. According to this paradigm, patterns in nature (including images)are formed by primitives called generators. The outputs of these generators are joined together using various graph-like criteria. Random diffeomorphisms applied to these patterns add another degree of generalization to the generatedpatterns. And-Or graphs lie within this stream (see Fig. 21). An And-Or graph uses conjunctions and disjunctions ofsimple patterns/generators to define a representation of all the possible deformations of the object of interest.

The fourth stream is similar to the previous stream, with the main difference being that an extremely sparse imagecoding model is used (employing simple image bases derived from Gabor filters parameterized by scale, orientationand contrast sensitivity for example) and that the related grammars can be viewed as being stochastic context freegrammars.

A number of feedforward hierarchical recognition algorithms have been proposed over the years [216, 217, 104,218, 105, 106, 183, 184, 219, 220, 221]. Such hierarchical architectures can be associated with the grammars discussedso far. One of the main characteristics of such hierarchical representations is that they often strive for biological

39

Grammar VT = a, b

VN = S

R = r1 : S aS, r2 : S b

and

or

leaf

s

b

a

r1

a s

r2r1

s

r2

b

And-Or tree

A parsing tree

A

B C

a b c d

Figure 21: (left) A grammar, its universal And-Or tree, and a corresponding parse-tree shown in shadow. A parse tree denotes a sequence ofproduction rules associated with the corresponding And-Or tree, which generate a corresponding string of terminals. (right) An And-Or treeshowing how elements a, b, c could be bound into structure A using two alternative ways. Adapted from Zhu and Mumford [205].

plausibility. Typically, such algorithms define a multiscale feedforward hierarchy, where at the lowest level of thefeedforward hierarchy, edge and line extraction takes place. During a training phase, combinations of such featuresare discovered, forming a hierarchical template that is typically matched to an image during online object search.

For example, LeCun’s work on convolutional networks [217, 222, 182], and a number of its variants, have beensuccessfully used in character recognition systems. Convolutional networks combine the use of local receptive fields,the use of shared weights, and spatial subsampling. The use of shared weights and subsampling adds a degree of shiftinvariance to the network, which is important since it is difficult to guarantee that the object of interest will always becentered in the input patch that is processed by the recognition algorithm. Because convolutional networks are purelyfeedforward they are also easily parallelizable, which has contributed to their popularity.

Another goal of hierarchical architectures is to provide an efficient grammar for defining a set of re-usable objectparts. These parts are typically meant to enable us to efficiently compose multiple views of multiple objects. As aresult, the problem of efficient feature grouping (see Fig.1) that removes any ambiguities due to environmental noiseor poor imaging conditions, keeps re-emerging in the literature on hierarchical recognition systems. The resultantambiguities are one of the main reasons why hierarchical representations do not scale very well when dealing withthousands of object classes. For example, Ommer et al. [183] and Ommer and Buhmann [184] propose a characteristicmethodology that attempts to deal with the complexity of real world image categorization, by first performing aperceptual bottom-up grouping of edges followed by a top-down recursive grouping of features.

It is important to point out that unrestricted object representation lengths and unrestricted representation class sizescan lead to significant problems when learning a new object’s representation [26]. Often with graph-like models, andespecially in early research, their representation strength (number of nodes and edges) is hand-picked for the trainingdataset of interest, which can potentially lead to a significant bias when tested with new datasets.

Deng et al. [185, 186], present an algorithm for learning hierarchical relations of semantic attributes from labelledimages. These relations are similar to predicates, and are arranged in the form of a hierarchical tree structure. Thecloseness of nodes in this hierarchical structure can be used to match similar images, which is in turn used for imageclassification purposes. The authors also introduce a hashing algorithm for sublinear image retrieval. Similarly, Bartet al. [187, 188] describe a graphical model for learning object taxonomies. Each node in this hierarchy representsinformation that is common to all the hierarchy’s paths that pass from that node, providing a compact representationof information. The authors present a Gibbs sampling based approach for learning the model parameters from trainingdata and discuss potential applications of such taxonomies in recognition. More recent work [189] demonstrates thecontinuous evolution of research on graph-based object representations using massively large training data sets andclusters of thousands of CPUs.

2.11. Some More Object Localization Algorithms

We now focus on object localization algorithms which are robust, and which localize objects in an image effi-ciently. This provides an overview of the approaches attempted over the years for efficiently localizing objects in astatic image. In subsequent sections we deal with the more complex case of active object localization and recognition,

40

Chart 6: Summary of the 1994-2006 papers from Table 4. We notice that search efficiency was not consistentlya primary concern in the localization literature since many algorithms tended to use an inefficient sliding windowapproach to localize objects. Furthermore Content-Based Image Retrieval systems mostly focused on the classificationof individual images and not on the localization problem within individual images. Nevertheless from Table 4 we seethat search efficiency was identified as an important topic in a number of papers. CBIR systems were focused onusing a diverse set of efficient indexing primitives. However it is far from clear that they achieved the inferencescaling properties desired, since in order to make such systems responsive and user friendly, often accuracy wassacrificed in favour of query efficiency. We also notice very little use of 3D in these systems.

where we also have to physically move the sensor over large distances. As we will see, and as it is evident from otherlocalization algorithms that were previously discussed within other contexts (e.g., Lampert et al. [191]), object local-ization efficiency is closely related to improvements in the hypothesis generation and object verification module ofrecognition systems (Fig.1). The breadth of algorithms tested for improving search efficiency is vast. They range fromsimple serial search mechanisms with winner-take-all, and reach all the way to complex systems, integrating parallelsearch with probabilistic decision making that make use of function, context, and hierarchical object representations.Within this context, active vision plays an important role. As such, this section also serves as a precursor to the activesearch and recognition systems discussed in Sec.3.

As previously discussed [26], time, noise, as well as other cost constraints, can make the problem significantlymore difficult. Furthermore, as discussed in [24, 26], searching for an object in an object class without knowledge ofthe exact target we are looking for — as the target appearance could vary drastically due to occlusions for example —makes the problem intractable as the complexity of the object class increases. This issue becomes more evident in thecase of the SLAM problem [26], where we want to simultaneously localize but also learn arbitrary new objects/featuresthat might be encountered in the scene. This leads to the slightly counter-intuitive conclusion that the feature detectionalgorithm/sensor used must be characterized by neither too high nor too low of a noise rate, since too low of adesirable object detector noise rate makes the online learning of new objects prohibitively expensive. In the activeobject localization problem, where we typically have to move robotic platforms to do the search in 3D space underocclusions, and we know apriori the object we are searching for, any reduction in the total number of such mechanicalmovements would have a drastic effect on the search time and the commercial viability of our solution. Thus, a centraltenet of our discussion in this section involves efficient algorithms for locating objects in an environment. In Chart 6and Table 4 we present a comparison, along certain dimensions, for a number of the papers surveyed in Secs.2.11-2.12.

Avraham and Lindenbaum present a number of interesting papers that use a stochastic bottom-up attention modelfor visual search based on inner scene similarity [223], [224], [225]. Inner scene similarity is based on the hypothesisthat search task efficiency depends on the similarities between scene objects and the similarities between target modelsand objects located in the scene. Assume, for example, that we are searching for people in a scene containing peopleand trees. An initial segmentation provides image regions containing either people or trees. An initial detectoron a tree segment returns a “no” as an answer. We can, thus, place a lower priority on all the image segmentshaving similar features as the rejected “no” segment, thus, speeding up search. These ideas were first put forward by

41

Duncan and Humphreys [226] who rejected the parallel vs. serial search idea put forward by Treisman and Geladeand suggested that a hierarchical bottom-up segmentation of the scene takes place first with similar segments linkedtogether. Suppression of one segment propagates to all its linked segments, potentially offering another explanationfor the pop-out effect. Avraham and Lindenbaum demonstrate that their algorithm leads to an improvement comparedto random search and they also provide measures/grades indicating how easy it is to find an object. They define ametric based on the feature space distances between the various features/regions that are discovered in an image.Assuming that each such feature might correspond to a target, they present some lower and upper bounds on thenumber of regions that need to be queried before a target is discovered. Three search algorithms are presented andsome bounds on their performance are derived:

1. FNN - Farthest Nearest Neighbour. Given the set of candidates’ feature vectors or segments x1, ..., xn, com-pute the distance of each such feature vector/segment to each nearest neighbour, and order the features basedon descending distance. Query the object detector module until finding the object of interest. The idea is thatthe target object is usually different from most of the rest of the image, so it should be close to the top 2 of thelist.

2. FLNN - Farthest Labelled Nearest Neighbour. Given the set of candidates’ feature vectors/segments x1, ..., xn,randomly choose one of these feature vectors/segments and label it using the object detector. Repeat until anobject is detected. For each unlabelled feature vector/segment, calculate its distance to the nearest labelledneighbour. Choose the feature vector/segment with maximum distance to query with the object detector and getits label. Repeat.

3. VSLE - Visual Search Using Linear Estimation. Define the covariance between two binary labels l(xi), l(x j)as cov(l(xi), l(x j)) = γ(d(xi, x j)) for some function γ and distance function d. Since the labels which denote thepresence or absence of the target are binary, their expected values denote the probability that they take a valueof 1. Given that we have estimated the labels l(x1),...,l(xm) for m feature vectors/segments, we seek to obtaina linear estimate lk = a0 +

∑mi=1 ail(xi) which minimizes the mean square error E((l(xk) − l(xk))2). It can be

shown that an optimization of this expected value depends on the covariance of various pairs of labels. Givenan image with n feature vectors/segments, calculate the covariance for each pair. Then, select the first candidaterandomly or based on prior knowledge. Then, at iteration m + 1, estimate lk for all k ≥ m + 1, based on theknown labels, and query the oracle about the candidate k for which lk is maximum. If enough targets are found,abort, else repeat.

The authors perform a number of tests and demonstrate that especially the VSLE algorithm leads to an improvementin the number of detected objects and the speed at which they are detected compared to a random search using theViola-Jones detection algorithm [227].

Draper et al. [228] present a Markov Decision Process (MDP) based approach for performing an online sortof feature selection, where an MDP is used to determine which set of recognition modules should be applied fordetecting a set of houses from aerial images. Paletta et al. have published a number of papers proposing the use ofreinforcement learning techniques — based on Q-learning — for deciding the image sub-regions where the recognitionmodule should attend to and extract features from and achieve recognition [229], [230]. Greindl et al. [231] present anattention based mechanism using a sequence of hierarchical classifiers for attending to image regions and recognizingobjects from images. Bandera et al. [232] present work that is similar to Paletta’s work in that they also propose aQ-learning based approach for determining the fixation regions that would lead to the greatest decrease in entropy andobject class discrimination. The features used in the experiments are simple corner based features and are encodedin a vector denoting the presence or absence of each feature in a scene. Recognition is achieved using a neuralnetwork trained on such example vectors. Note that in none of these papers do the authors use active cameras. Darrell[233] presents a formulation based on Partially Observable Markov Decision Processes (POMDP) with reinforcementlearning to decide where to look to discriminate a target from distractor patterns. The authors apply their approach tothe problem of gesture recognition.

Tagare et al. [234] present a maximum likelihood attention algorithm. The algorithm identifies object parts andfeatures within each object part. They propose pairs of object parts and part features which most likely come from theobject. In many ways the algorithm is an interpretation tree algorithm formulated using attention-like terminology.Probability densities are defined for the chance of occlusion — using a Poisson distribution — and a maximum

42

Figure 22: The algorithm by Viola and Jones [227, 235, 236]. On the left some of the two, three and four-rectangle features/kernels used areshown. Tens of thousands of these features are used as simple classifiers in an Adaboost framework, in order to define a cascade of classifiers whichprogressively prune image regions which do not contain the object of interest (see the right subfigure).

likelihood estimation is performed to determine the most probable part-feature pair, the second most probable part-feature pair, etc. Features used in their experiments include corners and edges. The algorithm ends up evaluating onlyabout 2% of all part-feature pairs on the test images used by the authors.

Torralba et al. [237] present a method for multiclass object detection that learns which features are common invarious disparate object classes. This allows the multiclass object detector to share features across classes. Thus, thetotal number of features needed to detect objects, scales logarithmically with the number of classes. Many objectlocalization algorithms train a binary classifier for each object the algorithm is attempting to localize, and slide awindow across the image in order to detect the object, if it exists. As the authors argue, the use of shared featuresimproves recognition performance and decreases the number of false positives. The idea is that for each object classc, a strong classifier Hc is defined which is a summation of a number of weak classifiers. Each of the weak classifiersis trained to optimally detect a subset of the C object classes. Since in practice there are 2C such classes the authorssuggest some heuristics for improving the complexity. Linear combinations of these weak classifiers under a boostingframework are acquired which provide the strong classifiers Hc. The authors’ working hypothesis is that by fittingeach weak classifier on numerous classes simultaneously, they are effectively training the classifiers to share features,thus, improving recognition performance and requiring a smaller number of features. The authors present some resultsdemonstrating that joint boosting offers some improvements compared to independent boosting under a ROC curveof false positives vs. detection rate. They also present some results on multiview recognition, where they simplyuse various views of each object class to train the classifiers. Similarly to the Viola and Jones algorithm, the authorsdemonstrate a characteristic of the algorithm which makes it suitable for localizing objects in scenes — using theshifted template/window approach —, namely, its small number of false positives.

Opelt et al. [238] present a boundary fragment model approach for object localization in images. Edges are de-tected using a Canny edge detector, from pre-segmented image regions containing the object of interest and containinga manually annotated centroid of each object. A brute force approach searches through the boundary fragments andsub-fragments, and a matching score of the fragment with a validation set is calculated. The matching score is basedon the Chamfer distance of the fragments from the fragments located in each image of the validation set. The match-ing score also depends on how close the centroids of each object are to each other; each fragment is associated to itscentroid. The authors define a set of weak detectors which typically learn pairs or triples of boundary fragments thatlead to optimal classification. Those weak detectors are joined into a strong detector which recognizes the desiredobject from an image. Overall, the use of boundary fragments makes this algorithm quite robust under illuminationchanges, and should be quite robust for solving simple exemplar-like detection tasks. Its high complexity is a sig-nificant drawback of the algorithm though. Simple approaches to multi-view object localization are proposed. Theirwork is further expanded upon in [196].

Amit and Geman [240] present a computational model for localizing object instances. Local edge fragments are

43

Figure 23: In [239] a vision-based computer-controlled wheelchair equipped with a 6DOF arm, actively searches an indoor environment to localizea door, approaches the door, pushes down the door handle, opens the door, and enters the next room.

grouped based on known geometrical relationships of the desired object that were learnt during the training phase.The authors note that even though their search model is not meant to be biologically plausible, it exhibits some of thecharacteristics of neurons in the Inferior Temporal Cortex such as scale and translation invariance.

A local feature based method for object localization is also presented by Piater [241]. Steerable filters are usedto extract features at various scales. These features are used to extract blobs, corners and edges from images. Theseprovide salient image features. As Amit and Geman [240] did, Piater clusters these features into compound featuresbased on their geometric relations. Piater’s thesis is interesting in that it proposes an online algorithm for learningnew discriminative features and retraining the classifier. Their algorithm is composed mainly of a Bayesian net-work/classifier that uses features to achieve recognition, and a feature learning system that uses the Bayesian classifierto train on training images and decide when a new discriminative feature needs to be added to the Bayesian network.

Viola and Jones [227], [235], [236] present a robust approach for object localization (see Fig. 22). Their approachis ideally suited for localizing objects due to the low number of false negatives it produces. Search is done by thesimple method of shifting a template across space and scale, making it somewhat inefficient in its search. A numberof Haar like features are extracted from the image. In the original formulation by Viola and Jones, Haar-like templatesprovide features similar to first and second order derivatives acquired from a number of orientations. By extractingfrom each image location p a number of such features at different scales and neighbourhoods close to p, it is easyto end up with thousands of features for each image pixel p. The authors propose using a cascade of classifiers,where each classifier is trained using Adaboost to minimize the number of false negatives by putting a lower weight ofimportance to features which tend to produce such false negatives. However, the method is only suited for detectingsimple objects characterized by a small number of salient features, such as faces and door handles [239]. See Fig. 23for an example of door handle localization using the Viola and Jones algorithm [227]. The algorithm would haveproblems detecting highly textured objects. Furthermore, the method’s training phase is extremely slow and care mustbe taken during the training phase as it is easy to end up with a classifier producing many false positives.

Fleuret and Geman [242] present a coarse-to-fine approach to object localization. The authors measure theiralgorithm’s performance in terms of the number of false positives and the amount of on-line computation neededto achieve a small false negative rate. The approach is tested on the face detection problem. The object class isrepresented by a large disjunction of conjunctions. Each conjunction represents salient object features under strictknown conditions in lighting, scale, location and orientation and the disjunctions account for a large number ofvariations of these conditions. The authors also present an interesting approach for measuring the efficiency of objectlocalization tasks based on the number of branches followed in a decision tree, where each branch might correspondto a search in a different image location, or different scale. They use such ideas to argue in support of a coarse to fineapproach for object localization. In other words, when searching for a certain object, we should first search across allscales and at the first failure to detect a necessary feature in one of the scales, search in a different image location. Theauthors present a rigorous proof of the optimality of such a search under a simpler model where the non-existence ofthe target is declared upon the first negative feature discovered. They also indicate that a coarse to fine approach wasproven optimal under a number of simulations they performed, even though the proof for the general case still eludesthem.

A number of Monte-Carlo approaches for object localization have also been attempted in an effort to escape theinefficiency of exhaustive search across scale-space for an object. Two such approaches based on particle filterswere proposed by Sullivan et al. [243] and Isard [202] who presented particle filter based approaches for performing

44


Scalability Efficiency Efficiency Scalability Primitives or Context 3D TextureAvraham and Lindenbaum [224] ** *** N/A N/A N/A *** * *Draper et al. [228] * ** ** * *** *** * *Paletta et al. [229] * *** ** * ** ** * *Greindl et al. [231] * *** ** * ** ** * *Bandera et al. [232] * *** ** N/A N/A *** * *Darrell [233] * ** ** ** ** ** * *Tagare et al. [234] * *** ** ** ** * * *Torralba et al. [237] *** * * * **** * * **Opelt et al. [238, 196] *** * * ** *** * * *Amit and Geman [240] ** ** *** *** ** * * *Piater [241] ** ** *** ** *** ** * *Viola et al. [236] *** * * * *** * * *Fleuret and Geman [242] ** *** ** *** ** * * *Flickner et al. [244] ** N/A ** ** *** * * **Gupta and Jain [245] ** N/A ** ** *** * * **Mukherjea et al. [246] ** N/A ** ** *** * * **Pentland et al. [247] ** N/A ** ** *** * * **Smith and Chang [248] ** N/A ** ** *** * * **Wang et al. [249] ** N/A ** ** *** * * **Ma and Manjunath [250] ** N/A ** ** *** * * **Laaksonen et al. [251] ** ** ** ** *** * * **


inference.

An overall observation is that many papers described as localization algorithms, simply follow the sliding windowapproach due to their low false-positive or low false-negative rate. Few of the original algorithms attempted to focuson salient regions based on prior knowledge of the object they were searching for. Cognizant of these problems,more recent work has focused on using saliency algorithms in conjunction with task-directed biases to speed up thesearch process [252]. As bottom-up algorithms do not use any task-directed biases, the benefits they offer during visualsearch only become evident when dealing with low-complexity scenes where the objects of interest pop-out easily withrespect to the background. While there has been some effort in incorporating top-down biases in such systems, it isnot clear whether they are capable of offering benefits for foreground and background regions of arbitrary complexity.As Avraham and Lindenbaum showed, proper attentional mechanisms can lead to better performance, so more focuson the problem is worthwhile. Notice that usually no cost/time constraints are included in the formulation of suchpapers. In subsequent sections we discuss a number of papers demonstrating the effect that such mechanisms canhave on search efficiency. Thus, within the context of the recognition framework of Fig.1, we notice a gradual shiftof the effort expended on the hypothesis generation and object verification phase. The shift is towards a preferencefor the integration of ever more powerful and complex inference algorithms that improve search efficiency, moving usbeyond the impediments of the sliding window approach to recognition.

45

2.12. Content-Based Image Retrieval

Every day around 2.5 quintillion bytes of data is created. It is estimated that 90% of the data in the world wascreated over the last two years [253]. A significant portion of this data consists of digital pictures and videos. Onlinevideo and multimedia content has experienced annual double digit growth for the last few years [254], precipitatingthe need for steady improvements in the automated mining of video and pictorial data for useful information. Nu-merous content based image retrieval (CBIR) systems have been proposed for addressing this problem. ProposedCBIR solutions typically lie at the intersection of computer vision, databases, information retrieval [168], HCI andvisualization/computer-graphics. Arguably, the first workshop on the topic took place in Florence in 1979 [168, 255].This was followed in 1992 by a workshop organized by the US National Science Foundation [168, 256] where theneed for interactive image understanding was emphasized. The need to include a non-trivial interaction componentis what differentiates research on CBIR systems from the more classical object recognition algorithms previouslydiscussed. See [168, 257, 258, 259] for good surveys on the topic. Some early influential example systems in thecommercial domain [258] include IBM’s QBIC [244], VIRAGE [245], and the NEC AMORE system [246], as wellas the MIT Photobook [247], the Columbia VisualSEEK/WebSEEK [248], Stanford’s WBIIS system [249] and theUCSB NeTra system [250] from academia.

CBIR has been a vibrant topic of research. Huijsmans and Sebe [260] note that image search can be split intothree main categories: (i) search by association, where an iterative process is used to refine the browsed images, (ii)aimed search where the user explicitly specifies the image he wishes to search for, and (iii) category search where amore loosely defined semantic class is searched for (which could be defined by a text string, an image example, or acombination of both).

Similarly, Datta et al. [258] categorize CBIR systems based on four axes: User Intent, Data Scope, Query Modal-ities and Visualization. User Intent is characterized by the clarity of the user’s intent: is the user browsing for pictureswith no clear goal (in which case the user is a Browser), a moderate goal where the user slowly becomes more clear inhis end-goal (the user is a Surfer) or he has from the very beginning a very clear understanding of what he is lookingfor (the user is a Searcher). Identifying the user type and adjusting the user interface accordingly could vastly improvethe user experience and potentially influence the commercial viability of a system, exemplifying the importance ofthe HCI component in the design of a CBIR system.

Clarifying the scope of the image and video data can also be very important since this can influence the system de-sign in terms of how reliable the image search has to be, how fast and responsive the underlying hardware architecturehas to be as well as what type of user interface to implement. These dimensions could be particularly important in thecase of social media websites for example, where users tend to share photo and video albums using a variety of dataacquisition modalities (e.g., smartphones). Datta et al. [258] classifies the image and video data based on whether it isintended for: (i) a personal collection expected to be stored locally, to be of relatively small size and to be accessibleonly to its owner, (ii) a domain specific collection such as medical images or images and videos acquired from aUAV, (iii) enterprise collection for pictures and videos available in an intranet and potentially not stored in a singlecentral location, (iv) archives for images and videos of historical interest which could be distributed in multiple diskarrays, accessible only via the internet and requiring different levels of security/access controls for different users, (v)Web-based archives that are available to everyone and should be able to support non-trivial user traffic volumes, storevast amounts of data, and search semi-structured and non-homogeneous data.

Query modalities for CBIR system can rely on keywords, free-text (consisting of sentences, questions, phrases,etc.), images (where the user requests that similar images, or images in the same semantic category as the query imageto be retrieved), graphics (where the user draws the query image/shape) as well as composite approaches based oncombinations of the aforementioned modalities[258].

Finally visualization is another aspect of a CBIR system that can influence its commercial success [258]. Relevance-ordered results are presented based on some order of importance of the results, and it is the approach adopted by mostimage search engines. Time-ordered results are presented in chronological order and are commonly used in social me-dia websites such as Facebook. A clustered presentation of the results can also provide an intuitive way of browsinggroups of images. A hierarchical approach to visualizing the data could be implemented through the use of metadataassociated with the images and videos. As noted in Datta et al. [258], a hierarchical representation could be usefulfor educational purposes. Finally combinations of the above visualization systems could be a useful feature whendesigning personalized systems.

46

CBIR systems need to be efficient and scalable, resulting in a preference towards the use of storage wise scal-able feature extraction algorithms. However such feature extraction algorithms must also be powerful in terms oftheir indexing capabilities. The so-called semantic gap, which we discussed in some detail in Secs.1, 2.1, is a topicwhich re-emerges in the CBIR literature. The ability to integrate feature primitives which are also powerful index-ing mechanisms is typically moderated by the fact that powerful indexing primitives tend to also be quite expensivecomputationally.

Scalable features are either color-based, texture-based or shape based. See Veltkamp and Tanase [259] for a goodoverview. Color features can rely on the use of an image’s dominant color, a region’s histogram, a color coherencevector, color moments, correlation histograms, or a local image histogram. Common texture features used includeedge statistics, local binary patterns, random field decomposition, atomic texture features and Wavelet, Gabor orFourier based features. Common shape features include the use of bounding boxes such as ellipses, curvature scalespace, elastic models, Fourier descriptors, template matching and edge direction histograms.

Thus, within the CBIR context, we see that the general recognition framework of Fig.1 forms a submodule of amore complex system lying at the confluence of HCI (user interfaces, user intent prediction, visualization) databasesystems, and object recognition.

Flickner et al. developed the QBIC system which uses color, texture and shape features [244, 259]. The RGB,YIQ, Lab and Munsell color spaces are used to extract whole image color representations or average color vectors. Theshape features rely on algebraic moments invariants (corresponding to eigenvalues of various constructed matrices),major axis orientation, shape area and eccentricity. Tamura’s texture features were the inspiration for the coarseness,contrast and directionality features used [261]. Querying is based on example images, sketches drawn by the user,and through the use of various query colors and texture patterns. Matching relies on variants of the Euclidean distancebetween the extracted color, shape and texture features. Relevance feedback enables the user to select retrieved imagesand use them as seeds in subsequent queries.

The Photobook from MIT’s Media Lab [262], is a system that relies on the extraction of faces, 2D shapes andtexture images. The faces and 2D shape features rely on the eigenvectors of a covariance matrix depending on pixelintensities and various feature points defining the object’s shape [113]. The texture description of the object dependson periodicity, directionality and randomness. For each image category a few characteristic prototypes for the imagecategory are selected. For each database image the average distance to these image prototypes is calculated. Thedistance of the query image to these averages is used during query time in order to match the query image to acategory.

The PicSOM system [251, 263, 264, 265] will be discussed in more detail towards the end of the paper as thesefeatures have also been used successfully in the annual PASCAL competitions. Briefly, a number of color featuresare extracted from the RGB channels, and the YIQ channels. Also the image edges (extracted with a Sobel operator)in conjunction with the low-passed Fourier spectrum, provide another 128-dimensional vector which is useful forrecognition purposes. A Self-Organizing Map (SOM) is used to match the images, where the distance between SOMunits corresponds to the Euclidean distance between the above described feature vectors.

From the CBIR systems compared in Table 4, we notice that it is difficult to compare and discriminate CBIRsystems along the typical dimensions used to compare recognition systems. This is because few performance metricsare typically disclosed for such systems. Early published work relied on fairly similar features (mostly color, shape,texture, and sometimes text based) and made little to no use of function, context, or 3D object representations. Morerecent work on image classification (as reviewed in Secs.2.7,2.9 for example) relies to a greater extent on the useof context. The main differentiating factor amongst CBIR systems lies in the user interface (how users enter theirqueries, the type of queries, the use of relevance feedback), how data is visualized, and query responsiveness.

The discussion on classical approaches to object recognition, has provided the reader with an overview of thetypes of algorithms that could be incorporated in a CBIR system. Practical CBIR systems have to make certain com-promises between the indexing power of the feature extraction algorithms, their generality, and their computationalrequirements. The practical success of a CBIR system will also ultimately rely on its user interface, the power ofits relevance feedback mechanism, the system’s ability to predict what the user is searching for, as well as on therepresentational power of the system’s core recognition algorithms.

47

3. Active and Dynamic Vision

In the introduction we overviewed some of the advantages and disadvantages of the active vision framework. Thehuman visual system has two main characteristics: the eyes can move, and visual sensitivity is highly heterogeneousacross visual space [33]. Curiously, these characteristics are largely ignored by the vision community.

The human eyes exhibit four types of behaviours: saccades, fixation, smooth pursuit, and vergence. Saccades areballistic movements associated with visual search. Fixation is partially associated with recognition tasks which donot require overt attention. Smooth pursuit is associated with tracking tasks and vergence is associated with visiontasks which change the relative directions of the optical axes. How do these behaviours fit within the active visionframework in computer vision? As discussed in Sec.2.7, it is believed that during early childhood development, theassociation between the sight of an object and its function is primed by manipulation, randomly at first, and then ina more and more refined way. This hints that there exists a strong association between active vision and learning.Humans are excellent in recognizing and categorizing objects even from static images 2. It can thus be argued thatactive vision research is at least as important for learning object representations as it is for online recognition tasks.

Findlay and Gilchrist [33] make a compelling argument in support of more research in the active approach tohuman vision:

1. Vision is a difficult problem consisting of many building blocks that can be characterized in isolation. Eyemovements are one such building block.

2. Since visual sensitivity is the highest in the fovea, in general, eye movements are needed for recognizing smallstimuli.

3. During a fixation, a number of things happen concurrently: the visual information around the fixation is ana-lyzed, and visual information away from the current fixation is analyzed to help select the next saccade target.The exact processes involved in this are still largely unknown.

Findlay and Gilchrist [33] also pose a number of questions, in order to demonstrate that numerous basic problems invision still remain open for research.

1. What visual information determines the target of the next eye movement?2. What visual information determines when eyes move?3. What information is combined across eye movements to form a stable representation of the environment?

As discussed earlier [29], a brute force approach to object localization subject to a cost constraint, is often in-tractable as the search space size increases. Furthermore, the human brain would have to be some hundreds ofthousands times larger than it currently is, if visual sensitivity across the visual space was the same as that in thefovea [29]. Thus, active and attentive approaches to the problem are usually proposed as a means of addressing theseconstraints.

We will show in this section that within the context of the general framework for object recognition that wasillustrated in Fig.1, previous work on active object recognition systems has conclusively demonstrated that activevision systems are capable of leading to significant improvements in both the learning and inference phases of objectrecognition. This includes improvements in the robustness of all the components of the feature-extraction→ feature-grouping→ object-hypothesis→ object-verification→ object-recognition pipeline.

Some of the problems inherent in single view object recognition, include [266]:

1. The impossibility of inverting projection and the fragility of 3D inference. It is, in general, impossible torecover a three dimensional world from its two dimensional projection on an image, unless we make restrictiveassumptions about the world.

2. Occlusion. Features necessary for recognition might be self-occluded or occluded by other objects.3. Detectibility. Features necessary for recognition might be missing due to low image contrast, illumination

conditions and incorrect camera placement [73].4. View Degeneracies. As discussed in [49], view degeneracies that are caused by accidental alignments can

easily lead to wrong feature detection and bad model parameterizations.

2see Biederman’s work at http://geon.usc.edu/∼biederman/ObjectRSVP.mov

48

Spatiotemporal approach to dynamic machine vision

model-based recognition | analysis through synthesis

intelligent control of image processing

measured features

Real World in 3-D space and time

continuous processes

s

patial

a

nd

te

mpora

l

Vid

eo

tec

hn

olo

gy

feature assign- ment

de

tectio

n

recognition of new objects

Experience

background knowledge

unassigned features

object 1

2object n

model adaptation

Learning

predicted features

Tracking

pers

pective

mapping

state

-reco

gnition

control of processes in real world many local

Jacobian matrices

`Internally represented world in 3-D space and time expectations

extended predictions

behavior-decisions

sit

ua

tio

n

sh

ort

te

rm

pre

dic

tio

ns

discretization & 3-D to 2-D mapping

Figure 24: Overview of the spatiotemporal (4-D) approach to dynamic vision (adapted from [50, 268]).

It is straight-forward to see how the above problems can adversely influence the components of a typical objectrecognition system shown in Fig.1. Various attempts have been made to address these problems. The various 3D activeobject recognition systems that have been proposed so far in the literature can be compared based on the followingfour main characteristics [267]:

1. Nature of the Next View Planning Strategy. Often the features characterizing two views of two distinctobjects are identical, making single view recognition very difficult. A common goal of many active recognitionstrategies is to plan camera movements and adjust the camera’s intrinsic parameters in order to obtain differentviews of the object that will enable the system to escape from the single view ambiguities. While classicalresearch on active vision from the field of psychology has largely focused on ’eyes and head’ movements, thenext-view planning literature in computer vision and robotics assumes more degrees of freedom since there areno constraints on how the scene can be sensed or what types of actuators the robotic platform can have.

2. Uncertainty Handling Capability of the Hypothesis Generation Mechanism. One can distinguish betweenBayesian based and non-Bayesian based approaches to the hypothesis generation problem and the handling ofuncertainty in inference.

3. Efficient Representation of Domain Knowledge. The efficiency of the mechanism used to represent domainknowledge and form hypotheses is another feature distinguishing the recognition algorithms. This domainknowledge could emerge in the form of common features — such as edges, moments, etc.— as well as otherfeatures that are appropriate for using context or an object’s function to perform recognition.

4. Speed and Efficiency of Algorithms for Both Hypothesis Generation and Next View Planning. Complexityissues arise, for example, in terms of the reasoning and next view planning algorithm that is used, but also interms of other components of the recognition algorithm. The complexity of those sub components can play adecisive role as to whether we will have a real-time performing active object recognition algorithm, even if weuse a highly efficient representation scheme of the domain knowledge from point 3.

As indicated in the introduction, the dynamic vision paradigm subsumes the active vision paradigm, and is more

49

focused on dynamic scenes where vision algorithms (such as recognition algorithms) are applied concurrently tothe actions being executed. Within this context, a significant topic of research in dynamic vision systems is theincorporation of predictions of future developments and possibilities [50, 268]. Historically, dynamic vision systemshave focused on the construction of vision systems that are reliable in indoor and outdoor environments. Within thiscontext, dynamic vision systems are also more tightly coupled to the research interests of the robotics community, ascompared to classical computer vision research.

Historically, dynamic vision research emerged due to the need to integrate recursive estimation algorithms (e.g.,Kalman filters) with spatio-temporal models of objects observed from moving platforms. As pointed out by Dick-manns [50], applying vision algorithms concurrently to the actions performed requires the following (also see Fig. 24):(i) The computation of the expected visual appearance from fast moving image sequences and the representation ofmodels for motion in 3-D space and time. (ii) Taking into account the time delays of the different sensor modalities,and taking into account these time delays in order to synchronize the image interpretation. (iii) The ability to robustlyfuse different elements of perception (such as inertial information, visual information and odometry information)whose strengths and weaknesses might complement each other in different situations. For example, visual feedback isbetter for addressing long term stability drift-problems which might emerge from inertial signals, while inertial signalsare better for short-term stability when implementing ego-motion and gaze stabilization algorithms. (iv) Incorporatinga knowledge-base of manoeuvre elements for helping with situational assessment. (v) Incorporating a knowledge baseof behavioural capabilities for various scene objects, so that the objects’ behaviour and identity could be identifiedmore easily from small temporal action elements. (vi) Taking into consideration the interdependence of the perceptualand behavioural capabilities and actions across the system’s various levels, all the way down to the actual hardwarecomponents.

We see that dynamic vision systems incorporate what is often also referred to as contextual information, thus takinga much broader and holistic approach to the vision problem. A significant insight of Dickmanns’ spatio-temporalapproach to vision was that the modelling of objects and motion processes over time in 3-D space (as comparedto modelling them directly in the image plane) and the subsequent perspective projection of those models in theimage plane, led to drastic improvements in the calculation of the respective Jacobian matrices used in the recursiveestimation processes, and thus became necessary components of a dynamic vision system. This approach led to thecreation of robust vision systems that were far more advanced than what had been considered to be the state-of-the-artup until then. Examples of such systems include pole balancing using an electro-cart [269], the first high-speed roadvehicle guidance by vision on a highway [50] (which includes modules for road recognition [270, 271, 272, 273],lane recognition, road curvature estimation, and lane switching [274, 50], obstacle detection and avoidance [275],recognition of vehicles and humans [276], and autonomous off-road driving [50]) as well as aircraft and helicopterswith the sense of vision for autonomously landing [277, 278]. Within the context of the recognition pipeline shownin Fig.1, we see that the work by Dickmanns improved the reliability of the measured features, it improved thereliability of predicted features, of the object hypotheses and their subsequent grouping, when attempting to extractthese features under egomotion. These improvements led to significant and surprising for the time innovations invision, by demonstrating for example the first self-driving vision-based vehicle.

In Sec. 1 we discussed some of the execution costs associated with an active vision system. These problems(such as the problem of determining correspondences under an imperfect stereo depth extraction algorithm and theproblem of addressing dead-reckoning errors) are further exacerbated in dynamic vision systems where the actionsare executed concurrently to the vision algorithms. This is one major reason why the related problems are moreprominently identified and addressed in the literature on dynamical vision systems, since addressing these problemsusually becomes a necessary component of any dynamic vision system.

At this point we need to make a small digression, and discuss the difference between passive sensors, active sen-sors, active vision and passive vision. While passive and active vision refers to the use (or lack thereof) of intelligentcontrol strategies applied to data acquisition process, an active sensor refers to a sensor which provides its own energyfor emitting radiation, which in turn is used to sense the scene. In practice, active sensors are meant to complementclassical passive sensors such as light sensitive cameras. The Kinect [295] is a popular example of a sensor thatcombines a passive RGB sensor and an active sensor (an infrared laser combined with a monochrome CMOS camera

50

Chart 7: Summary of the 1989-2009 papers in Table 5 on active object detection. By definition search efficiency is notthe primary concern in these systems, since by assumption the object is always in the sensor’s field of view. Howeverinference scalability constitutes a significant component of such systems. We notice very little use of function andcontext in these systems. Furthermore, training such systems is often non-trivial.

Figure 25: A sequence of viewpoints from which the system developed by Wilkes and Tsotsos [266] actively recognizes an origami object.

for interpreting the active sensor data and extracting depth). One could classify vision systems into those which haveaccess to depth information (3D) and those that do not. One could argue that the use of active sensors for extractingdepth information is not essential in the object recognition problem, since the human eyes are passive sensors andstereo depth information is not an essential cue for the visual cortex. In practice, however, active sensors are oftensuperior for extracting depth under variable illumination. Furthermore, depth is a useful cue in the segmentation andobject recognition process. One of the earliest active recognition system [286] made use of laser-range finders. Withinthe context of more recent work, the success of Kinect-based systems [296, 297, 298, 299] demonstrates how com-bined active and passive sensing systems improve recognition. For example, the work by Tang et al. [296] achievedtop ranked performance in a related recognition challenge, by leveraging the ability of the Kinect to provide accuratedepth information in order to build reliable 3D object models. Within the context of the recognition pipeline shownin Fig.1, active sensors enable us to better register the scene features with the scene depth. This enables the creationof higher fidelity object models, which in turn are useful in improving the feature grouping phase (e.g., determiningthe features which lie at similar depths) as well as the object hypothesis and recognition phases (by making 3D objectmodel matching more reliable).

3.1. Active Object Detection Literature Survey

With the advent popularity in the 1990s of machine learning and Bayesian based approaches for solving com-puter vision problems, active vision approaches lost their popularity. The related number of publications decreasedsignificantly between the late 1990s and the next decade.

51


Scalability Efficiency Efficiency Scalability Primitives or Context 3D TextureWilkes and Tsotsos [266] * ** * * * * * *Callari and Ferrie [279] * ** * * * * *** *Dickinson et al. [280] ** ** * *** *** * * *Schiele and Crowley [281] *** ** ** ** ** * * **Borotschnig et al. [282] *** ** ** *** ** * * **Paletta and Prantl [283] *** ** ** *** ** * * **Roy et al. [284] ** ** ** ** *** * ** *Andreopoulos and Tsotsos [239] ** ** * * * * *** *Roy and Kulkarni [285] ** ** ** *** *** * * *Hutchinson and Kak [286] * ** * * *** * *** *Gremban and Ikeuchi [287] ** ** * * * * ** *Herbin [288] * * * * * * * *Kovacic et al. [289] ** ** ** ** * * ** *Denzler and Brown [290] ** *** ** ** * * * *Laporte and Arbel [291] ** *** ** ** * * * *Mishra and Aloimonos [292] *** N/A N/A N/A *** ** *** ***Mishra et al. [293] *** N/A N/A N/A *** ** *** ***Zhou et al. [294] *** N/A N/A N/A N/A N/A N/A N/A

Table 5: Comparing some of the more distinct algorithms of Sec.3.1 along a number of dimensions. For each paper, and where applicable, 1-4 stars(*,**,***,****) are used to indicate the strength/expended effort along the corresponding dimension. These often implicitly denote why a particularpaper became well known. Where appropriate, a not-applicable label (N/A) is used. Inference Scalability: The focus of the paper on improving therobustness of the algorithm as the scene complexity or the object class complexity increases. Search Efficiency: The use of intelligent strategies todecrease the time spent localizing an object when the corresponding algorithm is used for localization. If it is a detection algorithm, this refers toits localization efficiency within the context of a sliding-window/exhaustive approach (i.e., the degree of the use of intelligent strategies to improvedetection efficiency). Training Efficiency: The level of automation in the training process, and the speed with which the training is done. EncodingScalability: The encoding length of the object representations as the number of objects increases or as the object representational fidelity increases.Diversity of Indexing Primitives: The distinctiveness and number of indexing primitives used. Uses Function or Context: The degree to whichfunction and context influences the algorithm. Uses 3D: The degree to which depth/range/3D information is used by the algorithm for inference ormodel representations. Uses Texture: The degree to which texture discriminating features are used by the algorithm.

This lack of interest in active vision systems is partially attributable to the fact that power efficiency is not a majorfactor in the design of vision algorithms. This is also evidenced by the evaluation criteria of vision algorithms inpopular conferences and journals, where usually no power metrics are presented. Note that an algorithm’s asymptoticspace and time complexity is not necessarily a sufficiently accurate predictor of power efficiency, since this doesnot necessarily model well the degree of communication between CPU and memory in a von-Neumann architecture.One of the main research interests of the object recognition community over the last 10-15 years, has been on theinterpretation of large datasets containing images and video. This has been mainly motivated by the growth of theinternet, online video, and smartphones, which make it extremely easy for anyone to capture high quality pictures andvideo. As a result most resources by the vision community have been focused on addressing the industry’s need forgood vision algorithms to mine all this data. As a result, research on active approaches to vision was not a priority.

Recently, however, there has been a significant upsurge of interest in active vision related research. This is ev-idenced by some of the more recent publications on active vision, which are also discussed in Secs.3.1-3.2. In thissection we focus on the active object detection problem, which involves the use of intelligent data acquisition strate-gies in order to robustly choose the correct value of at least one binary label/classification associated with a small 3Dregion. The main distinguishing characteristic of the active object detection literature, as compared to the literature onactive object localization and recognition, is that in the detection problem we are interested in improving the classifi-cation performance in some small 3D region, and are not as interested in searching a large 3D region to determine thepositions of one or more objects. In Charts 7, 8 and Tables 5,6 we compare, along certain dimensions, a number ofthe papers surveyed in Secs.3.1,3.2. Notice that the active object detection systems of Table 5 make little use of func-tion and context. In contrast to the non-active approaches, all the active vision systems rely on 3D depth extractionmechanisms through passive (stereo) or active sensors. From Tables 5,6 we notice that no active recognition system

52

is capable of achieving consistently good performance along all the compared dimensions. In this respect it is evidentthat the state of the art in passive recognition (Table 7) surpasses the capabilities of active recognition systems.

Wilkes and Tsotsos [266] published one of the first papers to discuss the concept of active object detection (seeFig. 25) by presenting an algorithm to actively determine whether a particular object is present on a table. As theauthors argue, single view object recognition has many problems because of various ambiguities that might arise inthe image, and the inability of standard object recognition algorithms to move the camera and obtain a more suitableviewpoint of the object and thus, escape from these ambiguities. The paper describes a behaviour based approach tocamera motion and describes some solutions to the above mentioned ambiguities. These ambiguities are discussedin more detail by Dickinson et al. [280]. Inspired by the arguments in [266], the authors begin by presenting certainreasons as to why the problem of recognizing objects from single images is so difficult. The reasons were discussedin the previous section and include the impossibility of inverting projection, occlusions, feature detectibility issues,the fragility of 3D inference, and view degeneracies. To address these issues the authors define a special view as aview of the object, optimizing some function f of the features extracted from the image data. Let P0, P1, P2 be threepoints on the object and di j denote the distance of the projected line between points Pi and P j. The authors try tolocate a view of the object maximizing d01, and d02 subject to the constraint that the distance of the camera from thecenter of the line joining P0 and P1 is at some constant distance r. The authors argue that such a view will make it lesslikely that they will end up in degeneracies involving points P0, P1, P2 [49]. Once they have found this special view,the authors suggest using any standard 2D pattern recognition algorithm to do the recognition. Within the contextof the standard recognition pipeline in Fig.1, we see that [266] showed how an active vision system can escape fromview-degeneracies, thus leading to more reliable feature extraction and grouping.

Callari and Ferrie [300, 279] introduce a method for view selection that uses prior information about the objectsin the scene. The work is an example of an active object detection system that incorporates contextual knowledge.This contextual knowledge is used to select viewpoints that are optimal with respect to a criterion. This constrainsthe gaze control loop, and leads to more reliable object detection. The authors define contextual knowledge as thejoin of a discrete set of prior hypotheses about the relative likelihood of various model parameters s, given a set ofobject views, with the likelihood of each object hypothesis as the agent explores the scene. The active camera controlmechanism is meant to augment this contextual knowledge and, thus, enable a reduction in the amount of data neededto form hypotheses and provide us with more reliable object recognition. The paper describes three main operationsthat an agent must perform: (a) Data collection, registration with previous data and modelling using a pre-definedscene class. (b) Classification of the scene models using a set of object hypotheses. (c) Disambiguation of ambiguoushypotheses/classifications by collecting new object views/data to reduce ambiguity. The paper does not discuss how tosearch through an arbitrary 3D region to discover the objects of interest. The paper assumes that the sensor is focusedon some object, and any motion along the allowed degrees of freedom will simply sense the object from a differentviewpoint (i.e., it tackles a constrained version of the object search problem). Thus, this active vision system provideda methodology for improving the object hypothesis and verification phases of the pipeline in Fig.1.

Dickinson et al. [280] combine various computer vision techniques in a single framework in order to achieverobust object recognition. The algorithm is given the target object as its input. Notice that even though the paperdoes deal with the problem of object search and localization within a single image, its next viewpoint controller dealsmostly with verifying the object identity from a new viewpoint, which is the reason we refer to this algorithm as anactive object detector.

The paper combines a Bayesian based attention mechanism, with aspect graph based object recognition and view-point control, in order to achieve robust recognition in the presence of ambiguous views of the object. See Figs. 26,27 for an overview of the various modules implemented in the system. The object representation scheme is a combi-nation of Object Centered Modelling and Viewer Centered Modelling. The object centered modelling is accomplishedby using 10 geons. These geons can be combined to describe more complex types of objects. The Viewer Centeredmodelling is accomplished by using aspects to represent a small set of volumetric parts from which an object is con-structed, rather than directly representing an object. One obvious advantage of this is the decrease in the size of theaspect hierarchy. However, if a volumetric part is occluded, this could cause problems in the recognition. To solvethis problem, the authors extend the aspect graph representation into an aspect graph hierarchy (see Fig. 12) whichconsists of three levels. The set of aspects that model the chosen volumes, the set of component faces of the aspects,and the set of boundary groups representing all subsets of contours bounding the faces. The idea is that if an aspectis occluded, they can use some of these more low-level features to achieve the recognition. From this hierarchy of

53

edge preserving adaptive smoothing

morphological gradient

gradient image

edge image

threshold with hysteresis

extract connected components (regions)

characterize region s shapes

label regions according to aspect hierarchy

face topology graph

region boundary topology graph

region topology graph

image

smoothed image

FACE RECOVERY

face topology graph

object database target object

expected scene contents

ATTENTION MECHANISM

select most likely volume

select most likely aspect

select most likely face

match target to image faces

ranked face search positions

target face

target aspect

target volume

Figure 26: The face recovery and attention mechanism used in [280] (diagram adapted from [280]).

54

target object

face topology graph

image face matching target face

target aspect

target volume

OBJECT VERIFICATION

attention control (acquire other object volumes, if necessary)

image primitive matching target volume

image aspect matching target aspect

image object matching target object

verify target aspect

instantiate target volume

assemble target object

MOVING THE CAMERA

Recovered Volume/Aspect

Aspect Prediction Graph

Direction of Camera Motion in Image Plane or on Surface of Viewing Sphere

Is Volume Ambiguous?

Select Least Ambiguous Aspect of Volume

Select Transition in APG from Current Aspect to Target Aspect

Select Direction of Camera Motion

No

ENDYes

Target Aspect

Target Transition

Figure 27: The object verification and next viewpoint selection algorithm used in [280] (diagram adapted from [280]).

55

geon primitives, aspects, faces and boundary groups, the authors create a Bayesian network, and extract the associatedconditional probabilities. The probabilities are extracted in a straightforward manner by uniformly sampling the geonsusing a Gaussian sphere. For example, to estimate the probability of face x occurring given that boundary group yis currently visible, they use the sampled data to calculate the related probability. From this data, the authors use aslight modification of Shannon’s entropy formula to discover that for the geon based representation, faces are morediscriminative than boundary groups. Therefore, they use faces as a focus feature for the recovery of volumetric parts.

Using various segmentation algorithms described in the literature, the authors segment the images and createregion topology graphs (denoting region adjacencies), region boundary topology graphs (denoting relations betweenpartitioned segments of bounding contours) and face topology graphs (indicating the labelled face hypothesis for allregions in the image). Each region’s shape in the image is classified by matching its region boundary graph to thosegraphs representing the faces in the augmented aspect hierarchy graph using interpretation tree search. This enablesthe creation of face topology graphs labelling the current image. They use this face topology graph labelling withattention driven recognition in order to limit search in both the image and the model database. Given as input theobject they wish to detect, the authors define a utility function U that can be used in conjunction with the previouslydefined conditional probabilities and the aspect graph, to determine the most likely face to start their search with,given the object they are trying to find. The search uses concepts inspired from game theory, and does the search untilthere is a good match between the face topology graph for the image and the augmented aspect graph.

Then, a verification step is done, by using various metrics to see if the aspects and volumes also match. Ifthere is no match the authors proceed with the next most likely matching face, and the process continues like this.Extensions of this recognition algorithm to multipart objects are also described and involve some extra steps in theverification phase searching for connectedness among their part aspects. The final component of the recognitionalgorithm involves viewpoint control. Viewpoint control makes it possible to resolve viewpoint degeneracies. Asalready discussed in this survey (also see the discussion towards the end of this section), such degeneracies have beenshown to frequently occur in practice. The authors define an aspect prediction graph which is a more compact versionof the aspect graph and specifies transitions between topologically equivalent views of the object. They use this graphto decide the direction of camera motion. The main idea is to move the camera to the most likely aspect — excludingthe already viewed aspects —, based on the previously calculated conditional probabilities and the most likely volumecurrently viewed, in order to verify whether it is indeed this hypothesized volume that is in the field of view. Then thealgorithm described above is repeated.

The main innovation of the paper is the combination of lots of ideas in computer vision (attention, object recog-nition, viewpoint control) in a single framework. Limitations of the paper include the assumption that objects can berepresented as constructions of volumetric parts — which is difficult for various objects such as clouds or trees —, andits reliance on salient homogeneous regions in the image for the segmentation. Real objects contain a lot of details,and the segmentation is in general difficult. Notice that there is room for improvement in the attentive mechanismsused. No significant effort is made to create a model that adaptively adjusts its expressive power during learning,potentially making proper training of the model somewhat of an art and dependent on manual intervention by theuser. As it is the case with many of the papers described so far, the model relies heavily on the extraction of edgesand corners which might make it difficult to distinguish an object based on its texture or color. Within the contextof Fig.1, the work by Dickinson et al. [280] proposes an active vision framework for improving all the componentsof the standard vision pipeline. This also includes the ‘object databases’ component, since the use of a hierarchy ismeant to provide a spacewise efficient representation of the objects.

Schiele and Crowley [281] describe the use of a measure called transinformation for building a robust recognitionsystem. The authors use this to describe a simple and robust algorithm for determining the most discriminatingviewpoint of an object. Spectacular recognition rates of nearly 100% are presented in the paper. The main idea ofthe paper is to represent the 3D objects by using the probability density function of local 2D image characteristicsacquired from different viewpoints. The authors use Gaussian derivatives as the local characteristics of the object.These derivatives allow them to build histograms (probability density functions) of the image resulting after theapplication of the filter. Assuming that a measurement set M of some local characteristics mk is acquired from theimage — where the local characteristics might be for example the x-coordinate derivatives or the image’s Laplacian —they obtain a probability distribution p(M|on,R,T, P, L,N) for the object on (where R,T, P, L,N denote the rotation,translation, partial occlusion, light changes and noise). The authors argue that for various reasons (the filters theyuse, the use of histograms and so on) the distribution is conditionally independent of various of these variables and it

56

suffices to build histograms for p(M|on, S ) where S denotes the rotation and one of the translation parameters. Theauthors define the quantity

p(on, S j|mk) =

∏k p(mk |on, S j)∑

n′, j′∏

k p(mk |on′ , S j′ )(9)

which gives the probability of object on in pose S j occurring given that we know the resulting images under setmk of filters. The probabilities on the right hand side are known through the histogram based probability densityestimation we described above. We can use this probability to recognize the object we currently view and its poseby simply maximizing the probability over all values for variables n, j. Test results that the authors cite indicate thatthis performs very well even in cases where only 40% of the object is visible. The authors then describe the objectrecognition process in terms of the transmission of information. The quantity

T (O,M) =

N,K∑n,k=1,1

p(on,mk)logp(on,mk)

p(on)p(mk)(10)

(for the sets O,M of the objects and image features respectively) is the transinformation. Intuitively, the lower thequantity, the “closer” the two sets are to being statistically independent, implying that one set’s values do not affectthe other set’s values. This is used to choose the salient viewpoints of an object and thus, provide an algorithm foractive object detection. By rewriting the previous equation for transinformation as

T (O,M) =

N∑n=1

p(on)K∑

k=1

p(mk |on)logp(mk |on)

p(mk)(11)

we see that the transinformation can be interpreted as the average transinformation of some object on’s transinforma-tion T (on,M) =

∑Kk=1 p(mk |on)log p(mk |on)

p(mk) . By going one step further and incorporating the pose S j of an object in theprevious definition of transinformation we get

T (on, S j,M) =

K∑k=1

p(mk |on, S j)logp(mk |on, S j)

p(mk)(12)

and we see that we can find the most significant viewpoints of an object by finding the maximum over all j of thisequation. The authors use this last formula to hypothesize the object identity and pose from an image. Then, they useagain this last formula to estimate the most discriminating viewpoint for the hypothesized object, move the camerato that viewpoint, perform verification and proceed until some threshold is passed, indicating that the object has beenidentified.

Overall, the main advantage of the paper is that it provides an elegant and simple method to perform objectrecognition. The test results provide strong evidence of the power of the active object recognition framework. Themore verification steps performed, the lower the misclassification rate. A drawback of the method is that it has not beentested on much larger datasets, and little work has been done to see how it performs under non-uniform backgrounds.Furthermore, a question arises on the algorithm’s performance as the errors in the estimation of the camera positionincrease. As discussed in [45], the implications could be significant.

Similarly to the above paper, Borotschnig et al. [282] use an information theoretic based quantity (entropy) in orderto decide the next view of the object that the camera should take to recognize the object and obtain more robust objectrecognition in the presence of ambiguous viewpoints. The approach uses an appearance based recognition system(inspired by Murase and Nayar’s [141] popular PCA based recognition algorithm) that is augmented by probabilitydistributions. The paper begins by describing how to obtain an eigenbasis of all the objects in our database from allviews. Then, given a new image, the algorithm can project that new image on the eigenbasis to obtain a point g in thatbasis, denoting the image. Denote by p(g|oi, ϕ j) the probability of point g occurring in the eigenspace of all objectsthat are projecting an image of object oi with pose parameters ϕ j. Under ideal circumstances p(g|oi, ϕ j) would be aspike function. In other words, the function would be zero for all values of g, except for one value for which it wouldbe equal to 1. However, due to various sources of error (fluctuations in imaging conditions, pan, tilt, zoom errors,segmentation errors etc.) the authors estimate this probability from a set of sample images with fixed oi and ϕ j values.

57

Figure 28: Graphical model for next-view-planning as proposed in [284, 285].

The probability density function is modelled as a multivariate Gaussian with mean and standard deviation estimatedfrom the sample images. By Bayes’ theorem it can be shown that

P(oi, ϕ j|g) =p(g|oi, ϕ j)p(ϕ j|oi)p(oi)

p(g). (13)

In the experiments the authors assumed that p(oi) and p(ϕ j|oi) are uniformly distributed. In their test cases the authorschoose a number of bins in which they will discretize the possible number of viewpoints and use them to build theseprobability distribution functions. Then, given some vector g in the eigenspace of shapes, the conditional probabilityof seeing object oi is given by p(oi|g) =

∑j P(oi, ϕ j|g). By iterating over all the objects in the database and finding the

most likely object, objects are recognized. The authors then further expand on this idea and present an algorithm foractively controlling the camera. They show that in cases where the object database contains objects that share similarviews, the active object recognition framework leads to striking improvements. The key to this is the use of plannedcamera movements that lead to viewpoints from which the object appears distinct. Note also that the authors useonly one degree of freedom for rotating around the object along a constant radius. However, extensions to arbitraryrotations should be straightforward to implement. The authors define a metric s(∆ψ) which gives the average entropyreduction to the object identity if the point of view is changed by ∆ψ. Since there is a discreet number of views,finding the optimal ∆ψ is a simple linear search problem. The authors make 3 major conclusions based on theirresults: (a) The dimension of the eigenspace can be lowered significantly if active recognition is guiding the objectclassification. In other words active recognition might open the way to the use of very large object databases, suitablefor real world applications (b) Even objects that share most views can be successfully disambiguated. (c) The numberof steps needed to obtain good recognition results is much lower than random camera placement, again indicating theusefulness of the algorithm (2.6 vs. 12.8 steps on average). This last point is further supported in [24]. The threeabove points demonstrate how an active vision framework might decrease the size of the object database needed torepresent an object, and help improve the object hypotheses and verification phase, by improving the disambiguationof objects that share many views (see Fig.1)

These ideas were further expanded upon by Paletta and Prantl [283], where the authors incorporated temporalcontext as a means of helping disambiguate initial object hypotheses. Notice that in their previous work, the authorstreated all the views as “bags of features” without taking advantage of the view/temporal context. In [283] the authorswork on this shortcoming by adding a few constraints to their probabilistic quantities. They add in their probabilisticformulation temporal context by encoding that the probability of observing a view (oi, ϕ j) due to a viewpoint change∆ϕ1 must be equal to the probability of observing view (oi, ϕ j − ∆ϕ1). This leads to a slight change in the Bayesianequations used to fuse the data and leads to an improvement in recognition performance. In [301] the authors use aradial-basis function based network to learn object identity. The authors point out that the on-line evaluation of theinformation gain, and most probabilistic quantities as a matter of fact, are intractable, and therefore, learned mappingsof decision policies have to be applied in next view planning to achieve real-time performance.

58

Roy et al. [284] presents an algorithm for pose estimation and next-view planning. A novelty of this paper isthat it presents an active object recognition algorithm for objects that might not fit entirely in the camera’s field ofview and does not assume calibrated intrinsic parameters. In other words it improves the feature grouping and objecthypothesis modules of the standard recognition pipeline (see Fig.1), through the use of a number of invariants thatenable the recognition of objects which do not fit in a camera’s field of view, and thus are not recognizable usinga passive approach to vision. It should be pointed out that this was the first active recognition/detection systemto tackle this important and often encountered real world problem. The paper introduces the use of inner camerainvariants for pose estimation. These image computable quantities, in general, do not depend on most intrinsic cameraparameters, but assume a zero skew. The authors use a probabilistic reasoning framework that is expressed in termsof a graphical model, and use this framework for next-view planning to further help them with disambiguating theobject. Andreopoulos and Tsotsos [239] also present an active object localization algorithm that can localize objectsthat might not fall entirely in the sensor’s field of view (see Fig. 23). Overall this system was shown to be robust inthe case of occlusion/clutter. A drawback of the method is that it was only tested with simple objects that containedparallelograms. It is interesting to see how the method would extend if we were processing objects containing morecomplicated features. Again, its sensitivity to dead-reckoning errors is not investigated.

Roy and Kulkarni [285] present a related paper with a few important differences. First of all, the paper does notmake use of invariant features as [284] does. Furthermore, the graphical model is used to describe an appearance basedaspect graph: features ρi j represent the aspects of the various objects in our database, and the classes Ck representthe set of topologically equivalent aspects. These aspects might belong to different parts of the same object, or todifferent objects altogether, yet they are identical with respect to the features we measure. For each class Ck theauthors build an eigenspace Uk of object appearances. Given any image I, they find the eigenspace parameter c, andaffine transformation parameter a, that would minimize

ρ(I(x + f (x, a)) − [Ukc](x), σ) (14)

where ρ is a robust error function, σ is a scale parameter and f is an affine transformation. They use this c, to find themost likely class Ck corresponding to the object. The probabilities are estimated by the reconstruction error inducedby projecting the image I on each one of the class eigenspaces Uk. The smaller the reconstruction error, the morelikely we have found the corresponding class. Then, the a priori estimated probabilities P(ρi j|Ck) are used to findthe most likely object Om corresponding to the viewed image. If the probability of the most likely object is not highenough, we need to move to a next view to disambiguate the currently viewed object. The view-planning is similar tothat of paper [284], only that there is just 1 degree of freedom in this paper (clockwise or counter clockwise rotationaround some axis). By using a heuristic that is very similar to the one in paper [284] and based on knowledge frompreviously viewed images of the object, the authors form a list of the camera movements that we should make todisambiguate the object. This procedure is repeated until the object is disambiguated.

The authors use the COIL-20 object database from Columbia University to do their testing. The single-viewbased correct recognition rate was 65.70% while the multi-view recognition rate increased to 98.19%, indicating theusefulness of the recognition results and the promise in general of the active object recognition framework under gooddead-reckoning. Furthermore, the average number of camera movements to achieve recognition was 3.46 vs. 5.40moves for the case of random camera movements, again indicating the usefulness of the heuristic the authors definedfor deciding the next view. Notice that this is consistent with the results in [24, 282]. Disadvantages of the paperinclude the testing of the method on objects with only a black background (of apparently little occlusion) and the useof only a single degree of freedom in moving the camera to disambiguate the object.

Hutchinson and Kak [286] presents one of the earliest attempts at active object detection. The authors generalizetheir work by assuming that they can have at their disposal lots of different sensors (monocular cameras, laser range-finders, manipulator fingers etc.). Thus, within the context of the standard object recognition pipeline (Fig.1), this isan example of a system that combines multiple types of feature extractors. It also represents one of the earliest activeapproaches for object hypothesis and verification. Each one of those sensors provides various surface features that canbe used to disambiguate the object. These features include surface normal vectors, Gaussian and Mean curvatures,area of each polyhedral surface and orientation, amongst others. By creating an aspect graph for each object and byassociating with each aspect the features corresponding to the surfaces represented by that aspect, the algorithm canformulate hypotheses as to the objects in a database that might correspond to the observed object. The authors then

59

aspect 2 class 2

aspect 1 class 1

aspect 0 class 0

aspect 7 class 0

aspect 6 class 1

aspect 5 class 2

aspect 4 class 3

aspect 3 class 3

Figure 29: The aspects of an object and its congruence classes (adapted from Gremban and Ikeuchi [287]).

do a brute force search on all the aspects of each aspect graph in the hypotheses, and move the camera to a viewpointof the object that will lead to the greatest reduction in the number of hypotheses. In general, this is one of the firstpapers to address the active object detection problem. A disadvantage of this paper is the oversimplifying assumptionof polyhedral objects. Another disadvantage is the heuristic used to make the camera movements since in general itgives no guarantees that this sensor movement will be optimal in terms of the number of movements till recognitiontakes place. Notice that complexity issues need to be addressed, since in practice the aspect graphs of objects are quitelarge and can make brute force search through the aspects of all aspect graphs infeasible. Furthermore, as it is thecase with most of the active object recognition algorithms described so far, the issue of finding the optimal sequenceof actions subject to a time constraint is not addressed.

Gremban and Ikeuchi [287] investigate the sensor planning phase of object recognition, and thus their work con-stitutes another effort in improving the object hypothesis and verification of the standard recognition pipeline (seeFig.1). Like many of the papers described in this survey, the algorithm uses aspect graphs to determine the next sensormovement. Similarly to [285] and [284], the authors of this paper make use of so called congruent aspects. In a com-puter vision system, aspects can be defined in various ways. The most typical way of defining them is based on the setof visible surfaces or the presence/absence of various features. Adjacent viewpoints over a contiguous object region,for which the features defining the aspect remain the same, give an aspect equivalence class. In practice, however,researchers who work with aspect graphs have noticed that the measured features can be identical over many disparateviewpoints of the object. This makes it impossible to determine the exact aspect viewed. These indistinguishable as-pects which share the same features are called congruence classes. The authors argue that any given feature set willconsist of congruent aspects and this is responsible for the fact that virtually every object recognition system uses aunique feature set — in order to improve the performance of the algorithm on that particular domain and distinguishbetween the congruent aspects. Other reasons why congruent aspects might arise include noise and occlusion. Theauthors argue that since congruent aspects cannot be avoided sensing strategies are needed to discriminate them. InFig. 29 we give an example of the aspects of an object and its congruence classes, where the feature used to definethe aspects is the topology of the viewed surfaces in terms of the visible edges. The authors use Ikeuchi and Kanade’saspect classification algorithm [41] to find the congruence class corresponding to the aspect viewed by the camera.The camera motion is used to decide the aspect that this particular class corresponds to. This is referred to as aspectresolution. This enables the system to recognize whether the image currently viewed contains the target object. Theauthors define a class restricted observation function Ω(ψ, θ) that returns the congruence class currently viewed by thecamera. The variable ψ defines the angle of rotation of the sensor around some axis in the object’s coordinate system— the authors assume initially that the only permissible motion is rotation around one axis — and θ denotes the rota-tion of the object with respect to the world coordinate frame. An observation function Ω(ψ, θ) can be constructed forthe object model that is to be identified in the image. The authors discuss in the paper only how to detect instances ofa single object, not how to perform image understanding. The authors initially position the camera at ψ = 0 — they

60

Figure 30: An aspect resolution tree used to determine if there is a single interval of values for θ that satisfy certain constraints (adapted fromGremban and Ikeuchi [287]).

assume that the object they wish to recognize is manually positioned in front of the camera with an appropriate pose— and estimate the congruence class γ that is currently viewed by investigating the extracted features (see Fig. 30).By scanning through the function Ω(ψ, θ) they find the set of values of θ, (if any), for which Ω(0, θ) = γ. If no valuesof θ satisfy this function, the object viewed is not the one they are searching for. Otherwise, by using a heuristic, theauthors move the camera to a new value of ψ, estimate the congruence class currently viewed by the camera and usethis new knowledge to further constrain the values of θ satisfying this new constraint (see Fig. 30). If they end up witha single interval of values of θ that satisfy all these constraints, they have recognized an instance of the object theyare looking for. The authors can also use this knowledge to extrapolate the aspect that the sensor is currently viewing,and thus, achieve aspect resolution. The authors describe various data structures for extending this idea to more thana single degree of camera motion.

Dickinson et al. [49] quantify an observation that degenerate views occupy a significant fraction of the viewingsphere surrounding an object and show how active and purposive control of the sensor could enable such a systemto escape from these degeneracies, thus leading to more reliable recognition. A view of an object is considereddegenerate if at least one of the two conditions below hold (see Fig. 31):

1. a zero dimensional (point-like) object feature is collinear with both the front nodal point of the lens2. and either:

(a) another zero dimensional object feature, or(b) some point on a line (finite or infinite) defined by two zero-dimensional object features.

The paper gives various examples of when such degeneracies might occur. An example of degeneracy is when wehave two cubes such that the vertex x of one cube is touching a point y on an edge of the other cube. If the front nodalpoint of the lens lies on the line defined by points x, y the authors say that this view of the object is degenerate. Ofcourse, in the case of infinite camera resolution, the chances of this happening are virtually non-existent. However,cameras have finite resolution. Therefore, the chances of degeneracies occurring are no longer negligible.

The authors conduct various experiments under realistic assumptions and observe that for a typical computervision setup the chances of degenerate views are not negligible and can be as high as 50%. They also tested aparameterization which partially matched the human foveal acuity of 20 seconds of arc, and noticed that the probability

61

Figure 31: The two types of view degeneracies proposed by Dickinson et al. [49].

of degeneracies is extremely small. The authors argue that this is one reason why the importance of degenerate viewsin computer vision has been traditionally underestimated. Obviously an active vision system could be of immense helpin disambiguating these degeneracies. The authors argue that if the goal is to avoid the degenerate views in a viewer-centered object representation or to avoid making inferences from such viewpoints, the vision systems must have asystem for detecting degeneracies and actively controlling the sensor to move it out of the degeneracy. One solutionto the problem of reducing the probability of degeneracy — or reducing the chance of having to move the camera —is to simply change the focal length of the camera to increase the resolution in the region of interest. The analysisperformed in the paper indicates that it is important to compensate for degeneracies in computer vision systems andalso further motivates the benefits of an active approach to vision. Intelligent solutions to the view-degeneracy problemcan decrease the probability of executing expensive and unnecessary camera movement to recognize an object. Withinthe context of the recognition pipeline in Fig.1, we see that these degeneracies could potentially affect all the modulesin the pipeline, from the quality of the low-level feature extracted, to the way the features are grouped, and to thereliability of the final object verification.

Herbin [288] presents an active recognition system whose actions can influence the external environment (cameraposition) or the internal recognition system. The author assumes the processing of segmented images, and usesthe silhouette of the objects — chess pieces — to recognize the object. The objects are encoded in aspect graphs,where each aspect contains the views with identical singularities of the object’s contour. Each view is encoded by aconstant vector indicating whether a convex point, a concave point or no extremum was found. Three types of actionsare defined: A camera movement by 5 degrees upwards or downwards and a switch between two different featuredetection scales. The author defines a training phase for associating an action at at time t given the sequence of statesup until time t. This simply learns the permissible actions for a certain object. Standard Bayesian methods determinewhether there is high enough confidence so far on the object identity, or whether more aspects should be learned.

Kovacic et al. [289] present a method for planning view sequences to recognize objects. Given a set of objectsand object views, where the silhouette of each object view is characterized by a vector of moment-based features, thefeature vectors are clustered. Given a detected silhouette, the corresponding cluster is determined. For each candidatenew viewpoint, the object vectors in the cluster are mapped onto another feature set of the same objects but from thenew viewpoint. A number of different mappings are attempted — where each mapping depends on the next potentialview — and each mapping’s points are clustered. The next view which results in the greatest number of clusters ischosen, since this will on average lead to the quickest disambiguation of the object class. This procedure is repeateduntil clusters with only one feature vector remain, at which point recognition is possible.

Denzler and Brown [290] use a modification of mutual information to determine optimal actions. They determinethe action al that leads to the greatest conditional mutual information between the object identity Ω and the observed

62

Chart 8: Summary of the 1992-2012 papers on active object localization and recognition from Table 6. As expected,search efficiency and the role of 3D information is significantly more prominent in these papers (as compared to Chart7).

Reconstructionist Vision Selective Perceptionuse all vision modules use only some vision modulesprocess entire image process areas of the image

maximal detail sufficient detailextract representation first ask question first

answer question from representation data answer question from scene dataemphasizes research on isolated vision modules emphasizes research on systems

use knowledge late in process use knowledge earlier in processstatic sensor active control of sensor

image understanding (reconstruction) solve (visual) taskunlimited resources resource limitationsbottom-up control top-down control with opportunism

Figure 32: Reconstructionist vision vs. Selective Perception, after Rimey and Brown [302]

feature vector c. Laporte and Arbel [291] build upon this work and choose the best next viewpoint by calculatingthe symmetric KL divergence (Jeffrey divergence) of the likelihood of the observed data given the assumption thatthis data resulted from two views of two distinct objects. By weighing each Jeffrey divergence by the product of theprobabilities of observing the two competing objects and their two views, they can determine the next view whichprovides the object identity hypothesis, thus again demonstrating the active vision system’s direct applicability in thestandard recognition pipeline (see Fig.1).

Mishra and Aloimonos [292] and Mishra et al. [293] suggest that recognition algorithms should always include anactive segmentation module. By combining monocular cues with motion or stereo, they identify the boundary edgesin the scene. This supports the algorithm’s ability in tracing the depth boundaries around the fixation point, which inturn can be of help in challenging recognition problems. These two papers provide an example of a different approachto recognition, where the intrinsic recognition module parameters are intelligently controlled and are more tightlycoupled to changes to the low-level feature cues and their grouping in the standard recognition pipeline (see Fig.1).

Finally, Zhou et al. [294] present an interesting paper on feature selectivity. Even though the authors presentthe paper as having an application to active recognition, and cite the relevant literature, they limit their paper to themedical domain (Ultrasound) by selecting the most likely feature(s) that would lead to accurate diagnosis. The authorspresent three slight modifications to information gain and demonstrate how to choose the feature y that would leadto maximally reducing the uncertainty in classification, given that a set of features X is used. They perform tests todetermine the strengths and weaknesses of each approach and recommend a hybrid approach based on the presented

63


Scalability Efficiency Efficiency Scalability Primitives or Context 3D TextureRimey and Brown [302] * ** * *** * **** * *Wixson and Ballard [303] * *** * * * **** * *Sjoo et al. [304] ** *** * ** ** *** **** *Brunnstrom et al. [305, 306] * ** * * ** ** ** *Ye and Tsotsos [307] * *** * * * ** **** *Minut and Mahadevan [308] * *** * * * *** * *Kawanishi et al. [309] * ** * * * * ** *Ekvall et al. [310] ** ** *** ** ** ** *** **Meger et al. [311] ** ** *** ** ** * *** *Forssen et al. [312] ** ** *** ** ** * *** *Saidi et al. [313] * *** * * * ** *** *Masuzawa and Miura [314] * *** ** ** ** * **** *Sjoo et al. [315] * ** * * * * *** *Ma et al. [316] ** ** ** ** ** * *** *Andreopoulos et al. [24] ** *** *** *** *** *** **** *

Table 6: Comparing some of the more distinct algorithms of Sec.3.2 along a number of dimensions. For each paper, and where applicable, 1-4stars (*,**,***,****) are used to indicate the strength/expended effort along the corresponding dimension. These often implicitly denote why aparticular paper became well known. Where appropriate, a not-applicable label (N/A) is used. Inference Scalability: The focus of the paper onimproving the robustness of the algorithm as the scene complexity or the object class complexity increases. Search Efficiency: The use of intelligentstrategies to decrease the time spent localizing an object when the corresponding algorithm is used for localization. If it is a detection algorithm,this refers to its localization efficiency within the context of a sliding-window approach (i.e., the degree of the use of intelligent strategies to improvedetection efficiency). Training Efficiency: The level of automation in the training process, and the speed with which the training is done. EncodingScalability: The encoding length of the object representations as the number of objects increases or as the object representational fidelity increases.Diversity of Indexing Primitives: The distinctiveness and number of indexing primitives used. Uses Function or Context: The degree to whichfunction and context influences the algorithm. Uses 3D: The degree to which depth/range/3D information is used by the algorithm for inference ormodel representations. Uses Texture: The degree to which texture discriminating features are used by the algorithm.

metrics as the optimal approach to conditional feature selection. Within the context of an active vision system, featureselection algorithms could be used to choose the optimal next sensor action.

While most of the methods discussed in this section mainly show that active image acquisition makes the problemeasier, the last few papers discussed give an insight of a general nature for object recognition, where active imageacquisition is tightly coupled to the more classical vision and recognition modules. Another general conclusion is thatvery few of the papers surveyed so far, take into consideration the effects of cost constraints, noise-constraints (e.g.,dead-reckoning errors) or object representational power. As it was previously argued [26], taking into account suchconstraints is of importance, since they can lead to a reassessment of proper strategies for next-view-planning andrecognition.

3.2. Active Object Localization and Recognition Literature Survey

We now present an overview of the literature on the active object localization and recognition problems. In morerecent literature, the problems are sometimes referred to under the title of semantic object search. In Table 6 andChart 8 we compare the algorithms discussed in this subsection, along a number of dimensions. A general conclusionone can reach, is that on average, the scalability of inference for active object localization algorithms is worse thanthe current state of the art in passive recognition (see Table 7 of Sec.4.2 for example). This is partially attributable tothe online requirements of active localization/recognition mechanisms, which make the construction of such real-timeand online systems a significant challenge.

Notice that in contrast to the Simultaneous Localization and Mapping (SLAM) problem, in the active objectlocalization problem the vision system is tasked with determining an optimal sequence of sensor movements thatenable the system to determine the position of the apriori specified object, as quickly as possible. In contrast, inthe SLAM problem, the scene features/objects are usually learnt/determined online during the map building process.Notice that within the context of Sec. 1, the localization and recognition problems subsume the detection problem,since the detection problem is a limited/constrained version of the localization and recognition problems.

64

Figure 33: A PART-OF Bayes net for a table-top scenario, similar to what was proposed by Rimey and Brown [302].

When dealing with the vision-based SLAM problem, the issue of extracting scene structure from a moving plat-form and using this information to build a map of the environment emerges. While this problem also emerges in theactive object localization and recognition problem, in practice, it is typically of secondary importance, since the mainresearch effort while constructing active object localization and recognition systems is focused around the creation ofthe object recognition module and the creation of the next-viewpoint selection algorithm. As it was pointed out atthe beginning of Sec. 3, active object localization and recognition research on dynamic scenes is limited, and in thisregard it is less developed than the structure from motion and SLAM literature.

For example Ozden et al. [317] indicate that the main requirements for building a robust dynamic structure frommotion framework, include:

• constantly determining the number of independently moving objects.

• segmenting the moving object tracks.

• computing the object 3D structure and camera motion with sufficient accuracy.

• resolving geometric ambiguities.

• achieving robustness against degeneracies caused by occlusion, self-occlusion and motion blur.

• scaling the system to non-trivial recording times.

It is straightforward to see that these also constitute important requirements when constructing an active objectlocalization and recognition system, since making a recognition system robust to these challenges would likely requirechanges to all the components of the standard recognition pipeline (see Fig.1). However, none of the active localizationand recognition systems that we will survey is capable of dealing with dynamic scenes, demonstrating that the field isstill evolving. Note that this last point differentiates active vision research from dynamic vision research (see Sec. 3).

In the active object localization and recognition problems, any reduction in the total number of mechanical move-ments involved would have a significant effect on the search time and the commercial viability of the solution. Thus,a central tenet of the discussion in this section involves efficient algorithms for locating objects in an environmentsubject to various constraints [45, 26]. The constraints include time constraints, noise rates, and object and scenerepresentation lengths amongst others. In Table 6 and Chart 8 we present a comparison, along certain dimensions, fora number of the papers surveyed in Sec.3.2.

Rimey and Brown [302] present the TEA-1 vision system that can search within a static image for a particularobject and that can also actively control a camera if the object is not within its field of view. Within the context of

65

Figure 34: An IS-A Bayes tree for a table-top scenario that was used by Rimey and Brown [302].

Minsky’s frame theory [124] which we discussed in Sec.2.7, the authors define a knowledge representation frameworkthat uses “PART-OF”, “IS-A” and “adjacent” relationships — a form of contextual knowledge — for guiding thesearch. The authors [302] also focus on the decision making algorithms that are used to control the current focusof attention during the search for the object. A Bayesian network is used to encode the confidences regarding thevarious hypotheses. As the authors point out, a significant constraint in any vision system that purposively controls anactive sensor, such as a camera, is resource allocation and minimization of the time-consuming camera movements.Purposiveness is necessary in any active vision system. The system must attempt specific tasks. Open ended taskssuch as “randomly move the camera around the entire room until the desired object falls in our field of view” lack thepurposiveness constraint. A number of papers [282, 285, 24] have experimentally demonstrated that random searchexhibits a significantly worse reliability and localization speed than purposive search, giving further credence to thearguments given in Rimey and Brown [302]. This approach to vision is inspired by the apparent effect that taskspecification has on human eye movements. As Yarbus demonstrated [318], human foveation patterns depend on thetask at hand and the fixated objects seem to be the ones relevant for solving a particular task. Somehow, irrelevantfeatures are ignored and humans do not search through the entire scene. This is exactly what Rimey and Brown aretrying to accomplish in their paper, namely, to perform sequential actions that extract the most useful information andperform the task in the shortest period of time. Thus, within the context of the standard recognition pipeline in Fig.1,this constitutes an effort in improving the object hypothesis generation module. The authors provide a nice summaryof the main differences in the selective/attentive approach to vision and the reconstructionist/non-active/non-attentiveapproach to vision (see Fig. 32).

The authors use two different Bayesian-network-like structures for knowledge representation: composite nets andtwo-nets. The composite net, as its name suggests, is composed of four kinds of nets: PART-OF nets, IS-A trees,expected area nets and task nets (see Figs. 34, 33). PART-OF nets are graphical models which use PART-OF relationsto model the feasible structure of the scene and the associated conditional probabilities (see Fig. 33). Each node isa Boolean variable indicating the presence or absence of a particular item. For example, a node might represent atabletop, its children might represent different kinds of tables, and each kind of table might have nodes denoting thetypes of utensils located on the particular table type. Expected area nets have the same structure as PART-OF nets andidentify the area in the particular scene where the object is expected to be located and the area it will take up. Theseare typically represented using 2D discrete random variables representing the probability of the object being located ina certain grid location. Also values for the height and width of objects are typically stored in the expected area net. A

66

Object location

False positives in search region w

Resolution required for recognition

Size of region n near object, relative to size of search region w

Cost of recognition

Most located near an intermediate

------

High

------

High

Most have a target instance nearby

Few

Low

Small

Low

Target Intermediate

Figure 35: The way various conditions affect the search for the target object and for intermediate objects. Dashed entries represent conditionswhich according to the model of Wixson and Ballard [303], do not affect the search efficiency. Adapted from [303].

relation-map is also defined which uses the expected area net to specify the relative location probability of one objectgiven another object. An IS-A tree is a taxonomic hierarchy representing mutually exclusive subset relationships ofobjects (see Fig. 34).

For example, one path in the hierarchy might be object→ table-object→ bowl→ black-bowl. A task-net specifieswhat kind of scene information could help with solving a recognition problem but it does not specify how to obtainthat information. The two-net is a simpler version of the composite net, and is useful for experimental analysis. Theauthors then define a number of actions such as moving the camera or applying a simple object detection algorithm.By iteratively choosing the most appropriate action to perform, and updating the probabilities based on the evidenceprovided by the actions, recognition is achieved. Each action has a cost and profit associated with it. The cost mightinclude the cost of moving a camera and the profit increases if the next action is consistent with the probability table’slikelihoods. Three different methods for updating the probabilities are suggested. The dummy-evidence method setsa user specified node in the composite-nets and two-nets to a constant value, specifying judgemental values about thenode’s values. The instantiate-evidence method is set when a specific value of a random variable is observed as true.Finally, the IS-A evidence approach uses the values output by an action to update the IS-A net’s probabilities usingthe likelihood ratios for some evidence e: λ = p(e|S )/p(e|¬S ) where S denotes whether a specific set of nodes in theIS-A tree was detected or not by the action. The cost and profits are used to define a goodness function which is usedto select the best next action. A depth first search in the space of all action sequences is used to select the best nextaction that would minimize the cost and lead to the most likely estimation of the unknown object or variable. Theauthors perform some tests on the problem of classifying whether a particular tabletop scene corresponds to a fancyor non-fancy meal and present some results on the algorithm’s performance as the values of the various costs wereadjusted. The method is tested only for recognizing a single 2D scene.

Wixson and Ballard [303] present an active object localization algorithm that uses intermediate objects to maxi-mize the efficiency and accuracy of the recognition system (see Fig. 35 and Fig. 36). The paper was quite influentialand similar ideas are explored in more, recent work [304, 319, 320]. The system by Wixson and Ballard [303] in-corporates some sort of contextual knowledge about the scene by encoding the relation between intermediate objects.Such intermediate objects are usually easy to recognize at low resolutions and are, thus, located quickly. Since wetypically have some clues about the target object’s location relative to the intermediate object’s location, we can useintermediate objects to speed up the search for the target. The authors present a mathematical model of search effi-ciency that estimates the factors which affect search efficiency, and they use these factors to improve search efficiency.They note that in their experiments, indirect search provided an 8-fold increase in efficiency. As the authors indicate,the higher the resolution needed to accurately recognize an object, the smaller the field of view of the camera has to be— because, for example, we might need to bring the camera closer to the object. However, this forces more mechan-

67

Susceptibility to false positives

# of object instances in search region

Image resolution

Size of search region

Clutter

Prob. that a positive response is a false

positive (β).

# of positive responses from recognizer in search region (P(R))

# of views to span search region (V)

Cost of applying detector (c)

Success probability (γ)

Expected # of views

Expected cost (T)

Figure 36: The direct-search model, which includes nodes that affect direct search efficiency (unboxed nodes) and explicit model parameters (boxednodes). Adapted from Wixson and Ballard [303].

ical movements of the camera to acquire more views of the scene, which are typically quite time consuming. Thisindicates a characteristic trade-off in the active localization literature that many researchers in the field have attemptedto address, namely, search accuracy vs. total search time.

In this work the authors speed up the search through the use of intermediate objects. An example is the taskof searching for a pencil by first locating a desk — since pencils are usually located on desks. Thus within thecontext of the standard recognition pipeline in Fig.1, this constitutes an effort in improving the feature grouping andobject hypothesis generation module, by using intermediate object to influence the grouping probabilities and relevanthypotheses of various objects or object parts. The authors demonstrate the smaller number of images required todetect the pencil if the intermediate object detected was the desk — an almost two-thirds decrease. The efficiency of asearch is defined as γ/T where γ is the probability that the search finds the object and T is the expected time to do thesearch. The authors model direct and indirect search. Direct search (see Figs. 35, 36) is a brute force search definedin terms of the random variable R denoting the number of objects detected by our object detection algorithm over asearch sequence spanning the search space, in terms of the probability β of detecting a false positive, the number ofpossible views V for the intermediate object and c j, the average cost for each view j. Usually c j is a constant c for allj. The success probability of indirect search is

γdir = [1 − P(R = 0)](1 − β) (15)

and the expected cost for the direct search is

Tdir[P(R),V, c] =

(P(R = 0)V +∑

∞r=1P(R = r)τ(1, r,V)) × c (16)

where τ(k, r,V) denotes the expected number of images that must be examined before finding k positive responses,given that r positive responses can occur in V images. A close look at the underlying parameters shows that β andP(R) are coupled: If everything else remains constant, a greater number of positive responses — a smaller value ofP(R = 0) — causes the expected values of R to be higher, but it also increases β.

An indirect search model (see Fig. 35) is defined recursively by applying a direct search around the neighbourhoodindicated by each indirectly detected object. The authors perform a number of tests on some simple scenes usingsimple object detectors. One type of test they perform, for example, is detecting plates by first detecting tables asintermediate objects. An almost 8-fold increase in detection speed is observed. These mathematical models examinethe conditions under which spatial relationships between objects can provide more efficient searches. The models and

68

Terminal L

Three-tangent ArrowY

Curvature-L T

Figure 37: Junction types proposed by Malik [321] and used by Brunnstrom et al. [306] for recognizing man-made objects.

experiments demonstrate that indirect search may require fewer images/foveations and increases the probability ofdetecting an object, by making it less likely that we will process irrelevant information. As with most early research,the work is not tested on the large datasets that more recent papers usually are tested on. Nevertheless, the results areconsistent with the results presented in [24], regarding the significant speed up of object search that is achieved if weuse a purposive search strategy, as compared to random search. We should point out that this paper does not take intoaccount the effects of various cost constraints and dead-reckoning errors. In contrast, it is mostly concentrated on thenext-view-planner while ignoring somewhat the possible effects due to the next-view-planner’s synergy with an objectdetector, in terms of simulating attentional priming effects to speed up the search for example.

Brunnstrom et al. [305, 306] present a set of computational strategies for choosing fixation points in a contextualand task-dependent manner. As shown in Fig. 37, a number of junctions are specified, and a grouping strategy forthese junctions is specified, where this grouping strategy is dependent on depth discontinuities (determined by a stereocamera), and also affects the sensor’s fixation strategy (see Fig.1). The authors present a methodology for determiningthe junction type present in the image, and argue that this strategy could be quite useful for recognizing an even largervariety of textureless objects.

Ye and Tsotsos [307] provide an early systematic study of the problem of sensor planning for 3D object search.The authors propose a sensor planning strategy for a robot that is equipped with a pan, tilt and zoom camera. Theauthors show that under a particular probability updating scheme, the brute force solution to the problem of objectsearch — maximizing the probability of detecting the target with minimal cost — is NP-Complete and, thus, proposea heuristic strategy for solving this problem. The special case of the problem under Bayesian updating was discussedin [45, 322]. The search agent’s knowledge of object location is encoded as a discrete probability density, and eachsensing action is defined by a viewpoint, a viewing direction, a field of view and the application of a recognitionalgorithm. The most obvious approach to solving this problem is by performing a 360o pan of the scene using wideangle camera settings and searching for the object in this whole scene. However, this might not work well if we aresearching for a small object that is relatively far away, since the object might be too small to detect. The authorspropose a greedy heuristic approach to solving the problem, that consists of choosing the action that maximizes thefraction of the expected object detection probability divided by the expected cost of the action. Thus, within thecontext of the recognition pipeline in Fig.1, this constitutes an algorithm for hypothesizing and verifying the objectspresent in the scene, by adjusting the viewpoint parameters with which the object is sensed.

Minut and Mahadevan [308] present a reinforcement learning approach to next viewpoint selection using a pan-

69

Se

ve

n T

arg

et C

on

fide

nce

Ma

ps

Humanoid robot, its coordinate frames, and its bounding cylinderTarget Map cells with an

object lying in one cell.

Fuse the image s seven Target Confidence Maps

Update the Target Map probabilities

X

Y

Z

Y

ZX

Figure 38: An ASIMO humanoid robot was used by Andreopoulos et al. [24] to actively search an indoor environment.

tilt-zoom camera. They use a Markov Decision Process (MDP) and the Q-learning algorithm to determine the nextsaccade given the current state, where states are defined as clusters of images representing the same region in theenvironment. A simple histogram intersection — using color information — is used to match an image I with atemplate M. If a match is found with a low resolution version of the image, the camera zooms in and obtains a higherresolution image and verifies the matching. If no match is found, (i.e., the desired object is not found), they use thepan tilt unit to direct the camera to the most salient region (saliency is determined by a symmetry operator defined inthe paper) located with one of 8 subregions. Choosing the subregion to search within is determined by the MDP andthe prior contextual knowledge it has about the room.

Kawanishi et al. [309] use multiple pan-tilt-zoom cameras to detect known objects in 3D environments. Theydemonstrate that with multiple cameras the object detection and localization problems can become more efficient (2.5times faster) and more accurate than with a single camera. The system collects images under various illuminationconditions, object views, and zoom rates, which are categorized as reference images for prediction (RIP) and verifi-cation (RIV). RIP images are small images that are discriminative for roughly predicting the existence of the object.RIV images are higher resolution images for verifying the existence of objects. For each image region that detected alikely object when using the RIP images, the cameras zoom in, and pan and tilt, in order to verify whether the objectwas indeed located at that image region.

More recently, Ekvall et al. [310] integrated a SLAM approach with an object recognition algorithm based onreceptive-field co-occurrence histograms. Other recent algorithms combine image saliency mechanisms with bags-of-features approaches [311, 312]. Saidi et al. [313] present an implementation, on a humanoid robot, of an active objectlocalization system that uses SIFT features [72] and is based on the next-view-planner described by Ye and Tsotsos[307].

Masuzawa and Miura [314] use a robot equipped with vision and range sensor to localize objects. The range finderis used to detect free space and vision is used to detect the objects. The detection module is based on color histograminformation and SIFT features. Color features are used for coarse object detection, and the SIFT features are usedfor verification of the candidate object’s presence. Two planning strategies are proposed. One is for the coarse objectdetection and one is for the object verification. The object detection planner maximizes a utility function for thenext movement, which is based on the increase in the observed area divided by the cost of making this movement.The verification planner proposes a sequence of observations that minimizes the total cost while making it possibleto verify all the relevant candidate object detections. Thus, this paper makes certain proposals for improving theobject hypothesis and verification module of the standard recognition pipeline (see Fig.1) by using a utility functionto choose the optimal next viewpoint.

Sjoo et al. [315] present an active search algorithm that uses a monocular camera with zoom capabilities. A robotthat is equipped with a camera and a range finder is used to create on occupancy grid and a map of the relevant featurespresent in the search environment. The search environment consists of a number of rooms. The closest unvisited roomis searched next, where the constructed occupancy grid is used to guide the robot. For each room, a greedy algorithmis used to select the order in which the room’s viewpoints are sensed, so that all possible object locations in the

70

Figure 39: An example of ASIMO pointing at an object once the target object is successfully localized in a 3D environment [24].

map are sensed. The algorithm uses receptive field co-occurrence histograms to detect potential objects. If potentialobjects are located, the sensor’s zoom settings are appropriately adjusted so that SIFT based recognition is possible.If recognition using SIFT features is not possible, this viewpoint hypothesis is pruned (also see Fig.1), and the processis repeated until recognition has been possible for all the possible positions in the room where an object might belocated.

Ma et al. [316] use a two-wheeled non-holonomic robot with an actuated stereo camera mounted on a pan-tilt unit,to search for 3D objects in an indoor environment. A global search based on color histograms is used to perform coarsesearch, somewhat similar in spirit to the idea of indirect search by Wixson and Ballard [303] which we previouslydiscussed. Subsequently, a more refined search (based on SIFT features and a stereo depth extraction algorithm) isused in order to determine the objects’ actual position and pose. An Extended Kalman Filter is used for sustainedtracking and the A* graph search is used for navigation.

Andreopoulos et al. [24] present an implementation of an online active object localization system, using an ASIMOhumanoid robot developed by Honda (see Figs. 38, 39). A normalized metric for target uniqueness within a singleimage but also across multiple images of the scene that were captured from different viewpoints, is introduced. Thismetric provides a robust probability updating methodology. The paper makes certain proposals for building morerobust active visual search systems under the presence of various errors. Imperfect disparity estimates, an imperfectrecognition algorithm, and dead-reckoning errors, place certain constraints on the conditions chosen for determiningwhen the object of interest has been successfully localized. A combination of mutliple-view recognition and single-view recognition approaches is used to achieve robust and real-time object search in an indoor environment. Ahierarchical object recognition architecture, inspired by human vision, is used [218]. The object training is done by in-hand demonstration and the system is extensively tested on over four-hundred test scenarios. The paper demonstratesthe feasibility of using state of the art vision-based robotic systems for efficient and reliable object localization in anindoor 3D environment. This constitutes an example of a neuromorphic vision system applied to robotics, due to theuse of (i) a humanoid robot that emulates human locomotion, (ii) the use of a hierarchical feed-forward recognitionsystem inspired by human vision, and (iii) the use of a next-view planner that shares many of the behavioural propertiesof the ideal searcher [323]. Within the context of the recognition pipeline in Fig.1, this constitutes a proposal forhypothesizing and verifying the objects present in the scene (by adjusting the viewpoint parameters with which theobject is sensed) and for extracting and grouping low-level features more reliably based on contextual knowledgeabout the relative object scale.

As previously indicated, on average, the scalability of inference for active object localization algorithms is worsethan the current state of the art in passive recognition. This is partially attributable to the online requirements of activelocalization/recognition mechanisms, which make the construction of such real-time and online systems a significantchallenge. Furthermore, powerful vision systems implemented on current popular CPU architectures are extremelyexpensive power-wise. This makes it difficult to achieve the much coveted mobility threshold that is often a necessaryrequirement of active object localization algorithms.

71

Objects

vehicles household animals

4-wheeled

2-wheeled

furniture

seating

domestic

farmyard

car

bus

bicycle

motorcycle

airplane

boat

train

chair

sofa

dining table

TV/monitor

bottle

potted plant

cat

dog

bird

cow

horse

sheep

person

Figure 40: The twenty object classes that the 2011 PASCAL dataset contains. Some of the earlier versions of the PASCAL dataset only usedsubsets of these object classes. Adapted from [324].

4. Case Studies From Recognition Challenges and The Evolving Landscape

In this section we present a number of case studies that exemplify the main characteristics of algorithms that havebeen proven capable of addressing various facets of the recognition problem. Based on this exposition we also providea brief discussion as to where the field appears to be headed.

4.1. Datasets and Evaluation Techniques

Early object recognition systems were for the most part tested on a handful of images. With the exception ofindustrial inspection related systems, basic research related publications tended to focus on the exposition of novelrecognition algorithms, with a lesser focus on actually quantifying the performance of these algorithms. More recently,however, large annotated datasets of images containing a significant number of object classes, have become readilyavailable, precipitating the use of more quantitative methodologies for evaluating recognition systems. Everinghamet al. [324] overview the PASCAL challenge dataset, which is updated annually (see Fig. 40). Other popular datasetsfor testing the performance of object/scene classification and object localization algorithms include the Caltech-101and Caltech-256 datasets (Fei-Fei et al. [325], Griffin et al. [326]), Flickr groups 3, the TRECVID dataset (Smeatonet al. [327]), the MediaMill challenge (Snoek et al. [328]), the Lotus-Hill dataset (Yao et al. [329]), the ImageCLEFdataset (Sanderson et al. [330]), the COIL-100 dataset (Nene et al. [331]), the ETH-80 dataset (Leibe and Schiele[332]), the Xerox7 dataset (Willamowski et al. [333]), the KTH action dataset (Laptev and Lindeberg [334]) theINRIA person dataset (Dalal and Triggs [335]), the Graz dataset (Opelt et al. [336]), the LabelMe dataset (Russellet al. [337]) the TinyImages dataset (Torralba et al. [338]), the ImageNet dataset (Deng et al. [339]), and the Stanfordaction dataset (Yao et al. [340]). Notice that such offline datasets have almost exclusively been applied to passiverecognition algorithms, since active vision systems cannot be easily tested using offline batches of datasets. Testingan active vision system using offline datasets would require an inordinate number of images that sample the entiresearch space under all possible intrinsic and extrinsic sensor and algorithm parameters. Typically, such systems areinitially tested using simple simulations, followed by a significant amount of time that is spent field testing the system.

A number of metrics are commonly used to provide succinct descriptors of system performance. Receiver Oper-ating Characteristic (ROC) curves are often used to visualize the true positive rate versus the false positive rate of anobject detector (see Sec. 1) as a class label threshold is changed, assuming of course that the algorithm uses such athreshold (note that sometimes in the literature the false positive rate is also referred to as the false accept rate, andthe false negative rate is referred to as the false reject rate). In certain cases Detection Error Tradeoff (DET) curvesare used to provide a better visualization of an algorithm performance [341], especially when small probabilities areinvolved. The equal error rate (EER) corresponds to the false positive value FP achieved when the correspondingROC curve point maps to a true positive value T P that satisfies FP = 1− T P. This metric is convenient as it provides

3http://www.flickr.com/groups

72

a single value of algorithm quality (a lower EER value indicates a better detector). The area under the curve of anROC curve is also often used as a metric of algorithm quality. The use of the average precision (AP) metric in themore recent instantiations of the PASCAL challenge has also gained acceptance [324, 342]: The average precision(AP) is defined as

AP =1|R|

|R|∑k=1

ck (17)

where |R| is the set of positive examples in the validation or test set,

ck =

|R∩Mk |

k if the algorithm is correcton the kth sample

0 otherwise(18)

and Mk = i1, ..., ik is the list of the top k best performing test set samples. Standard tests of statistical signifi-cance (e.g., t-tests, ANOVA tests, Wilcoxon rank-sum tests, Friedman tests) are sometimes used when comparing theperformance of two or more algorithms which output continuous values (e.g., comparing the percentage of overlapbetween the automated object localization/segmentation with the ground-truth segmentation). See [343, 344, 345] fora discussion on good strategies for annotating datasets and evaluating recognition algorithms.

Our discussion on evaluation techniques for recognition algorithms would be incomplete without the presentationof the criticism associated with the use of such datasets. Such criticism is sometimes encountered in the literatureor in conferences on vision research (see [193, 73, 194, 346] for example). In other words, the question arises asto how good indicators these datasets and their associated tests are for determining whether progress is being madein the field of object recognition. One argument is that the current state-of-the-art algorithms in object recognitionidentify correlations in images, and are unable to determine true causality, leading to fragile recognition systems. Anexample of this problem arose in early research on neural networks, where the task was to train a neural network todetermine the presence or absence of a certain vehicle type in images 4. The neural network was initially capableof reliably detecting the objects of interest from the images of the original dataset. However, on a new validationdataset of images, the performance dropped drastically. On careful examination it was determined that in the originaldataset, the images containing the object of interest had on average a higher intensity. During training, the neuralnetwork learned to decide whether the object was present or absent from the image, by calculating this average imageintensity and thresholding this intensity value. It is evident that in the original dataset there existed a correlationbetween average image intensity and the object presence. However in the new dataset this correlation was no longerpresent, making the recognition system unable to generalize in this new situation that the human visual system iscapable of addressing almost effortlessly. It has been argued that only correlation can be perceived from experience,and determining true causality is an impossibility. In medical research the mitigation of such problems is oftenaccomplished through the use of control groups and the inclusion of placebo groups, which allow the scientist to testthe effect of a particular drug by also testing the effect of the drug under an approximation of a counter-factual stateof the world. However, as experience has shown – and as it is often the case in computer vision research – the resultsof such controlled experiments, whose conclusions ultimately rely on correlations, are often wrong. Ioannidis [347]analyses the problem, and provides a number of suggestions as to why this phenomenon occurs, which we quotebelow:

• The smaller the case studies, the more likely the findings are false.

• The smaller the effect sizes in a research field, the less likely the research findings are true. For examplea study of the impact of smoking on cardiovascular disease will more likely lead to correct results than anepidemiological study that targets a small minority of the population.

• The greater the number and the lesser the selection of tested relationships in a scientific field, the less likely theresearch findings are to be true. As a result, confirmatory designs such as large controlled trials are more likelytrue, than the results of initial hypothesis-generating experiments.

4Geoff Hinton, personal communication

73

VOC2011 Zhu et al. [348], Chen et al. [349], Song et al. [350]VOC2010 van de Sande et al. [351], Bourdev and Malik [352], Bourdev et al. [353] Perronnin et al. [354],

Chen et al. [355]VOC2009 Vedaldi et al. [356], Wang et al. [357], Khan et al. [358, 359]VOC2008 Harzallah et al. [360, 361], Tahir et al. [342, 362], Felzenszwalb et al. [363]VOC2007 Perronnin and Dance [364], Zhang et al. [142], Chum and Zisserman [365],Felzenszwalb et al. [366],

Ferrari et al. [367], van de Weijer and Schmid [368], Viitaniemi and Laaksonen [265]VOC2006 Everingham et al. [369], Laaksonen et al. [263], Zhang et al. [142], Dalal and Triggs [335],

Viitaniemi and Laaksonen [264]VOC2005 Everingham et al. [370], Zhang et al. [142], Leibe et al. [199], Dalal and Triggs [335]

Figure 41: Documents describing some of the top-ranked algorithms for classifying and localizing objects in the PASCAL Visual Object ClassesChallenges of 2005-2011. Note that this is not an exhaustive list of the algorithms tested in the VOC challenges: it is simply meant to provide asample of the most “distinct” approaches that have been proven over the years to provide satisfactory results in these challenges. See [371] and[324] for an overview of the competition and a listing of all the algorithms tested over the years.

Input Image

Normalize gamma & colour

Compute gradients

Weighted vote into spatial & orientation cells

Contrast normalize over sampling spatial blocks

Collect HOG s over detection window

Linear SVM

Person/ non-person classification

Figure 42: The pipeline used by Dalal and Triggs [335].

• The greater the flexibility in designs, definitions, outcomes and analytical models in a scientific field, the lesslikely the research findings are to be true. For example, flexibility increases the potential of turning “negative”results into “positive” results. Similarly, fields that use stereotyped and commonly agreed analytical methods,typically result in a larger proportion of true findings.

• The greater the financial and other interests and prejudices in a scientific field, the less likely the researchfindings are to be true. As empirical evidence shows, “expert opinion” is extremely unreliable.

• The hotter a scientific field, with more scientific teams involved, the less likely the research findings are true.

The fact that usually only positive results supporting a particular hypothesis are submitted for publication, whilenegative results not supporting a particular hypothesis are often not submitted for publication, can make it moredifficult to understand the limitations of many methodologies [347]. Despite these potential limitations, the hardreality is that the use of datasets currently constitutes the most reliable means of testing recognition algorithms. AsPinto et al. [193] indicate, an improvement in evaluation methodologies might entail simulating environments andtesting recognition systems on these environments. But of course creating environments that are acceptable by thevision community and which are sufficiently realistic, is a challenging task. As argued in [73], typically offline datasetsare pre-screened for good quality in order to eliminate images with saturation effects, poor contrast, or significantnoise. Thus, this pre-screening introduces an implicit bias in the imaging conditions of such datasets. In the case ofactive and dynamic vision systems, which typically sense an environment from a greater number of viewpoints andunder more challenging imaging conditions, it becomes more difficult to predict the performance of a vision systemby using exclusively such datasets.

4.2. Sampling the Current State-of-the-Art in the Recognition Literature

A survey on the object recognition literature that does not attempt to determine what the state-of-the-art is interms of performance, would be incomplete. To this extent, we present in some detail some of the algorithms forwhich there is some consensus in the community in terms of them belonging to the top-tier of algorithms that reliablyaddress the object detection, localization and recognition problems (see Sec. 1). In Chart 9 and Table 7 we present acomparison, along certain dimensions, for a number of the papers that will be surveyed in Sec.4.2. For the reasonsearlier elaborated upon, determining the best performing algorithms remains a difficult problem. In the active anddynamic vision literature there does not currently exist a standardized methodology for evaluating the systems in

74

Chart 9: Summary of the PASCAL Challenge papers from Table 7 which correspond to algorithms published between2002-2011. Notice that the winning PASCAL challenge algorithms typically make little use of function, context, 3Dand make a moderate use of texture.

Figure 43: The HOG detector of Dalal and Triggs (from [335] with permission). (a): The average gradient image over a set of registered trainingimages. (b), (c): Each pixel demonstrates the maximum and minimum (respectively) SVM weight of the corresponding block. (d): The test imageused in the rest of the subfigures. (e): The computed R-HOG descriptor of the image in subfigure (e). (f),(g): The R-HOG descriptor weighed bythe positive and negative SVM weights respectively.

terms of their performance and search efficiency. However, sporadically, there have been certain competitions (suchas the semantic robot vision challenge) attempting to address these questions. Arguably the most popular competitionfor evaluating passive recognition algorithms is the annual PASCAL challenge. We thus focus our discussion in thissection on presenting in some detail the general approaches taken by some of the best performing algorithms in theannual PASCAL challenge for classifying and localizing the objects present in images. In general, good performanceon the PASCAL datasets is a necessary condition of a solution to the recognition problem, but it is not a sufficientcondition. In other words, good performance on a dataset does not guarantee that we have found a “solution”, but itcan be used as a hint, or a simple guiding principle, for the construction of vision systems, which is why we focus onthese datasets in this section. For each annual PASCAL challenge, we discuss some of the best performing algorithmsand discuss the reasons as to why the approaches from each year were able to achieve improved performance. Theseannual improvements are always characterized within the general setting described in Fig.1.

From Table 7 and Chart 9 we notice that the top-ranked PASCAL systems make very little use of 3D objectrepresentations. In modern work, 3D is mostly used within the context of robotics and active vision systems (seeTables 5-6). In general, image categorization/classification algorithms (which indicate whether an image containsan instance of a particular object class), are significantly more reliable than object localization algorithms whosetask is to localize (or segment) in an image all instances of the object of interest. Good localization performancehas been achieved for restricted object classes: in general there still does not exist an object localization algorithmthat can consistently and reliably localize arbitrary object classes. As Chum and Zisserman [365] indicate, image

75

Figure 44: Examples of the Harris-Laplace detector and the Laplacian detector, which were used extensively in [142] as interest-point/regiondetectors (figure reproduced from [142] with permission).

classification algorithms have achieved significant improvements since early 2000, and this is in general attributed tothe advent popularity of powerful classifiers and feature representations.

4.2.1. Pascal 2005We now briefly discuss some of the best performing approaches tested during the 2005 Pascal challenge for the

image classification and object localization problems (see Fig. 41). This is not meant to be an exhaustive listingof the relevant approaches, but rather to provide a sample of some relatively successful approaches tested over theyears. 2005 was the first year of the PASCAL Visual Object Challenge. One of the best performing approaches waspresented by Leibe et al. [199], which we also overviewed in Sec. 2.9.

Dalal and Triggs [335] tested their Histogram of Oriented Gradient (HOG) descriptors in this challenge. Intheir original paper, Dalal and Triggs focused on the pedestrian localization problem, but over the years HOG-basedapproaches have become quite popular, and constitute some of the most popular descriptors in the object recognitionliterature. See Fig. 42 for an overview of the pipeline proposed by Dalal and Triggs. The authors’ experimentssuggest that the other best-performing keypoint-based approaches have false positive rates that are at least 1-2 ordersof magnitude greater than their presented HOG dense grid approach for human detection. As the authors indicate,the fine orientation sampling and the strong photometric normalization used by their approach, constitute the beststrategy for improving the performance of pedestrian detectors, because it enables limbs and body segments to changetheir position and their appearance (see Fig. 43). The authors evaluated numerous pixel colour representations suchas greyscale, RGB and LAB colour spaces, with and without gamma equalization. The authors also tested variousapproaches for evaluating gradients, and based on their results the simplest scheme which relied on point-derivativeswith Gaussian smoothing gave the best results. The main constituent component of the HOG representation is theorientation binning with normalization that is applied to various descriptor blocks/cells. The cells tested are bothrectangular and radial. Orientation votes/histograms are accumulated in each one of those cells. The orientation binstested are both unsigned (0-180 degrees) and signed (0-360 degrees). The authors choose to use 9 orientation bins sincemore bins only lead to marginal improvements at best. Furthermore, the authors note that the use of signed orientationsdecreases performance. The authors also tested various normalization schemes, which mostly entail dividing thecell histograms by the orientation “energy” present in a local neighborhood. The above-described combinations forconstructing histograms of orientation were then used in conjunction with linear and non-linear SVMs, achievingstate-of-the art performance for pedestrian detection. Note, however, that the system was tested on images where thesize of the pedestrians’ projection on the image was significant. A final observation that the authors make is that anysignificant amount of smoothing before gradient calculation degrades the system performance, demonstrating that themost important discriminative information is from sudden changes in the image at fine scales.

Zhang et al. [142] discuss a number of local-image-feature extraction techniques for texture and object categoryclassification. In conjunction with powerful discriminative classifiers, these approaches have led to top-tier perfor-mance in the VOC2005, VOC2006 and VOC2007 competitions. Their work is mostly focused on the problem ofclassifying an image as containing an instance of a particular object, and is not as much focused on the object local-ization problem. As we discussed earlier, and as we will discuss in more detail later in this section, a good classifierdoes not necessarily lead to a good solution to the object localization problem. This is due to the fact that simple brute-force sliding-window approaches to the object localization problem are extremely slow, due to the need to enumerate

76

Figure 45: The distributions of various object classes corresponding to six feature classes. These results were generated by the self-organizing-mapalgorithm used in the PicSOM framework [263]. Darker map regions represent SOM areas where images of the respective object class have beendensely mapped based on the respective feature (from [263] with permission).

all possible positions, scales, and aspect ratios of a bounding-box for the object position.As Zhang et al. [142] indicate, in the texture recognition problem local features play the role of frequently re-

peated elements, while in the object recognition problem, these local features play the role of “words” which are oftenpowerful predictors of a certain object class. The authors show that using a combination of multiple interest-point de-tectors and descriptors, usually achieves much better results than the use of a single interest-point detector/descriptorpair achieves. They also reach the conclusion that using local features/descriptors with the highest possible degreeof invariance, does not necessarily lead to the optimal performance. As a result, they suggest that when designingrecognition algorithms, only the minimum necessary degree of feature invariance should be used. The authors notethat many popular approaches make use of both foreground and background features. They argue that the use ofbackground features could often be seen as a means of providing contextual information for recognition. However, asthe authors discover during their evaluation, such background features tend to aid when dealing with “easy” datasets,while for more challenging datasets, the use of both foreground and background features does not improve the recog-nition performance.

Zhang et al. [142] use affine-invariant versions of two interest point detectors: the Harris-Laplace detector [102]which responds to corners, and the Laplace detector [372] which responds to blob regions (see Fig. 44). Theseelliptical regions are normalized into circular regions from which descriptors are subsequently extracted. The authorsalso test these interest-point detectors using scale invariance only, using scale with rotation invariance, and by usingaffine invariance. As descriptors, the authors investigated the use of SIFT, SPIN, and RIFT descriptors [373, 374].The SIFT descriptor was discussed in Sec. 2.9. The SPIN descriptor is a two dimensional rotation invariant histogramof intensities in a neighborhood surrounding an interest-point, where each histogram cell (d, i) corresponds to thedistance d from the center of the region and the weight of intensity value i at that distance. The RIFT descriptor issimilar to SIFT and SPIN, where rotation invariant histograms of orientation are created for a number of concentriccircular regions centered at each interest point. The descriptors are made invariant to affine changes in illumination,by assuming pixel intensity transformations of the form aI(x) + b at pixel x, and by normalizing those regions withrespect to the mean and standard deviation. The authors use various combinations of interest-point detectors, detectors

77

Figure 46: Example of the algorithm by Felzenszwalb et al. [366] localizing a person using the coarse template representation and the higherresolution subpart templates of the person (from [366] with permission).

and classifiers to determine the best performing combination. Given training and test images, the authors create amore compact representation of the extracted image features by clustering the descriptors in each image to discoverits signature (p1, u1), ..., (pm, um), where m is the number of clusters discovered by a clustering algorithm, pi isthe cluster’s center and ui is the fraction of image descriptors present in that cluster. The authors discover thatsignatures of length 20-40 tend to provide the best results. The Earth Mover’s Distance (EMD) [375] is used to define a“distance” D(S 1, S 2) between two signatures S 1, S 2. The authors also consider the use of mined “vocabularies/words”from training sets of images, corresponding to “clusters” of common features. Two histograms S 1 = (u1, ..., um),S 2 = (w1, ...,wm) of such “words” can be compared to determine if a given image belongs to a particular object class.The authors use the χ2 distance to compare two such histograms:

D(S 1, S 2) =12

m∑i=1

(ui − wi)2

ui + wi. (19)

Image classification is tested on SVMs with linear, quadratic, Radial-Basis-Function, χ2 and EMD kernels, where theχ2 and EMD kernels are given by

K(S i, S j) = exp(−1A

D(S i, S j)), (20)

where D(·, ·) can represent the EMD or χ2 distance and A is a normalization constant. The bias term of the SVMdecision function is varied to obtain ROC curves of the various tests performed. The system is evaluated on texture andobject datasets. As we have already indicated, the authors discover that greater affine invariance does not necessarilyhelp improve the system performance. The Laplacian detector tends to extract four to five times more regions perimage than the Harris-Laplace detector, leading to better performance in the image categorization task, and overall acombination of Harris-Laplace and Laplacian detectors with SIFT and SPIN descriptors. Both the EMD and χ2 kernelsseem to provide good and comparable performance. Furthermore, the authors notice that randomly varying/shufflingthe backgrounds during training, results in more robust classifiers.

Within the context of Fig.1 (i.e., the feature-extraction→ feature-grouping→ object-hypotheses→ object-verification→object-recognition pipeline), we see that the best performing systems of PASCAL 2005 demonstrate how the care-ful pre-processing during the low level feature extraction phase makes a significant difference in system reliability.Small issues such as the number of orientation bins, the number of scales, or whether to normalize the respective his-tograms, make a significant difference in system performance. This demonstrates the importance of carefully studyingthe feature-processing strategies adopted by the winning systems. One could argue that vision systems should not beas sensitive to these parameters. However, the fact remains that current state-of-the-art systems have not reached thelevel of maturity that would make them robust against such variations in the low-level parameters. Another obser-vation with respect to Fig.1 is that the object representations of the winning systems in PASCAL 2005, were for themost part “flat” and made little use of the object hierarchies whose importance we have emphasized in this survey.

78

Figure 47: The HOG feature pyramid used in [366], showing the coarse root-level template and the higher resolution templates of the person’ssubparts (from [366] with permission).

As we will see, in more recent work, winning systems have made greater use of such hierarchies. Finally, whileone could argue that Leibe et al. [199] made use of a generative object hypothesis and verification phase, in general,winning algorithms of PASCAL 2005 were discriminative based, and did not make use of sophisticated modules forimplementing the full pipeline of Fig.1.

4.2.2. Pascal 2006In addition to the previously described methodologies, a combination of the approaches described in [263], [264]

was proven successful for many of the object classes tested in VOC2006 (see Fig. 41). The presented algorithm([263], [264]) is used both for the VOC challenge’s image classification task as well as for the object localization task.In testing their algorithm for the object localization task, the authors consider an object as successfully localized ifa0 > 0.5 where

a0 =area(Bp ∩ Bgt)area(Bp ∪ Bgt)

. (21)

and Bgt, Bp denote the ground truth and localized image regions. The object classification and localization systemtested relies to a large extent on the PicSOM framework for creating self-organizing maps (see Fig. 45). The authorsin [264] take advantage of the topology preserving nature of the SOM mapping to achieve an image’s classificationby determining the distance of the image’s representation on the grid, from positive and negative examples of therespective object class hypothesis. For the classification task a greedy sequential forward search is performed toenlarge the set of features used in determining the distance metric, until the classification performance stops increasingon the test dataset. The feature descriptors used, include many of the descriptors used in the MPEG-7 standard as wellas some non-standard descriptors. The authors experimented with using numerous color descriptors. These include,for example, color histograms in HSV and HMMD color spaces and their moments, as well as color layout descriptors,where the image is split in non-overlapping blocks and the dominant colors in YCbCr space are determined for eachblock (the corresponding discrete cosine transform coefficients are used as the final descriptors). Furthermore, Fourierdescriptors of segment contours are used as features, as well as histograms and co-occurrence matrices of Sobeledge directions. The object localization algorithm relies to a large extent on the use of a simple greedy hierarchicalsegmentation algorithm that merges regions with high similarity. These regions are provided as input to the classifier,which in turn enables the object localization.

Thus, within the context of Fig.1 we see that during PASCAL 2006, and as compared to PASCAL 2005, one of thewinning systems evolved by making use of a significantly greater number of low level features. Furthermore, the useof a self organizing map by Viitaniemi and Laaksonen [264] demonstrated that the proper grouping and representationof these features plays a conspicuous role in the best performing algorithms.

79

4.2.3. Pascal 2007During the 2007 challenge, the work by Felzenszwalb et al. [366] was tested on a number of object localization

challenges. The algorithm’s ability to localize various object classes was further demonstrated in subsequent years’competitions, where it consistently achieved good performance for various object classes (see Fig. 46). The authorsachieved a two-fold improvement in the person detection task (as compared to the best performing person detectionalgorithm from the 2006 Pascal challenge) and for many object classes it outperformed the best results from the 2007challenge. As the authors point out, there appears to be a performance gap in terms of the performance differencebetween parts-based methods and rigid-template or bags-of-features type of representations. The authors point outthat a strong point of their paper is the demonstration that parts-based methods are capable of bridging this perfor-mance gap. The system is based on shifting a scanning window over the input image in order to fit the target objectrepresentation on the input image. The object representation consists of a root and a single level of subparts. A de-formation cost is defined for the subpart windows’ deformations/positions with respect to the root window position(see Fig. 47). The score of the placement of an object representation placement is the sum of the scores of all thewindows. A latent variable SVM is used during the training process, where the latent variable is used to learn aset of filter parameters (F0, F1, ..., Fn) and deformation parameters (a1, b1, ..., an, bn). For each input image and anysubpart deformation z, a vector of HOG features (H) and subpart displacements ψ(H, z) is extracted. The score ofpositioning an object representation on an image using arrangement z, is given by the dot product β · ψ(H, z), whereβ = (F0, F1, ..., Fn, a1, b1, ..., an, bn).

In more detail, the authors define a latent SVM as

fβ(x) = maxz∈Z(x)

β · ψ(x, z) (22)

where β · ψ(x, z) is the score of positioning the object representation according to deformation z, and Z(x) denotesall possible deformations of the object representation. Given a training dataset D = (〈x1, y1〉, ..., 〈xn, yn〉) (where xi

denotes the ith HOG pyramid vector and yi ∈ −1, 1 denotes a label), the authors attempt to find the optimal vectorβ∗(D) which is defined as

β∗(D) = arg minβλ‖β‖2 +

n∑i=1

max(0, 1 − yi fβ(xi)). (23)

Notice, however, that due to the existence of positive labelled examples (yi = 1), this is not a convex optimizationproblem. As a result the authors execute the following loop a number of times: (i) Keep β fixed and find the optimallatent variable zi for the positive example. (ii) Then by holding the latent variables of positive examples constant,optimize β by solving the corresponding convex problem. The authors try to ignore the “easy” negative trainingexamples, since these examples are not necessary to achieve good performance. During the initial stage of the trainingprocess, a simple SVM is trained for only the root filter. The optimal position of this filter is then discovered ineach training image. Since the training data only contains a bounding box of the entire object and does not specifythe subpart-positions, during training the subparts are initialized by finding high-energy subsets of the root-filter’sbounding box. This results in a new training dataset that specifies object subpart positions. This dataset is iterativelysolved using the methodologies above in order to find the filter representations for the entire object and its subparts.Theauthors decide to use six subparts since this leads to the best performance.

Perronnin and Dance [364] use the Fisher kernel for image categorization. The authors extract a gradient vectorfrom a generative probability of the extracted image features (local SIFT and RGB statistics). These gradient vectorsare then used in a discriminative classifier. An SVM and a logistic regression classifier with a Laplacian prior istested. They both perform similarly. The authors indicate that historically, even on databases containing very fewobject classes, the best performance is achieved when using large vocabularies with hundreds or thousands of visualwords. However, the use of such high-dimensional histogram computations can have a high associated computationalcost. Often the vocabularies extracted from a training image dataset are not universal, since they tend to be tailoredto the particular object categories being learnt. The authors indicate that an important goal in vision research is todiscover truly universal vocabularies, as we already discussed in Sec. 2. However, the lack of significant progressin this problem, has caused some researchers to abandon this idea. In more detail, given a set of visual words X =

x1, x2, ..., xT extracted from an image, a probability distribution function p(X|λ) with parameters λ is calculated. In

80

Figure 48: The distribution of edges and appearance patches of certain car model training images used by Chum and Zisserman [365], with thelearned regions of interest overlaid (from [365], with permission).

practice, this pdf is modelled as a Gaussian Mixture Model. Given the Fisher information matrix

Fλ = EX[∇λp(X|λ)∇λp(X|λ)T ] (24)

the authors obtain the corresponding normalized gradient vectors F−12

λ ∇λp(X|λ). The authors derive analytical expres-sions for these gradients with respect to the mean, variance and weight associated with each one of the Gaussiansin the mixture that model this probability. These gradients were used to train powerful classifiers, which providedstate-of-the-art image classification performance on the Pascal datasets.

Viitaniemi and Laaksonen [265] overview a general approach for image classification, object localization, andobject segmentation. The methodology relies on the fusion of multiple classifiers. The authors report the slightlycounter-intuitive observation that while their approach provides the best performing segmentation results, and someof the best image classification results, the approach is unable to provide the best object localization results.

van de Weijer and Schmid [368] expand local feature descriptors by appending to the respective feature vectorsphotometric invariant color descriptors. These descriptors were tested during the 2007 Pascal competition. Theauthors survey some popular photometric invariants and test the effects they have in recognition performance. It isdemonstrated that for images where color is a highly discriminative feature, such color invariants can be quite useful.However, there is no single color descriptor that consistently gives good results. In other words, the optimal colordescriptor to use is application dependent.

Chum and Zisserman [365] introduced a model for learning and generating a region of interest around instancesof the object, given labelled and unsegmented training images. The algorithm achieves good localization performancein various PASCAL challenges it was tested on. In other words, the algorithm is given as input only images of theobject class in question, with no further information on the position, scale or orientation of the object in the image.From this data, an object representation is learnt that is used to localize instances of the object of interest. Given aninput or training set of images, a hierarchical spatial pyramidal histogram of edges is created. Also a set of highlydiscriminative “words” is learned from a set of mined appearance patches (see Fig. 48). A cost function that is the sumof the distances between all pairs of training examples is used to automatically learn the object position from an inputimage. The cost function takes into account the distances between the discriminative words and the edge histograms.A similar procedure, with a number of heuristics, is used to measure the similarity between two images and localizeany instances of the target object in an image.

Ferrari et al. [367] present a family of translation and scale-invariant feature descriptors composed of chains ofk-connected approximately straight contours, referred to as kAS . See Fig. 49 for examples of kAS for k = 2. It isshown that for kAS of intermediate complexity, these fragments have significant repeatability and provide a simpleframework for simulating certain perceptual grouping characteristics of the human visual system (see Fig. 7). Theauthors show that kAS substantially outperform interest points for detecting shape-based classes. Given a vocabularyof kAS, an input image is split into cells, and a histogram of the kAS present in each cell is calculated. An SVMis then used to classify the object present in an image, by using a multiscale sliding window approach to extract the

81

Figure 49: The 35 most frequent 2AS constructed from 10 outdoor images (from [367] with permission).

respective SVM input vector that is to be classified. Given an input image, the edges are calculated using the Berkeleyedge detector which takes into consideration texture and color cues (in addition to brightness) when determining theobjects present in an image. Two extracted edges are “connected” if they are only separated by a small gap or if theyform a junction. This results in a graph structure. For each edge, a depth-first search is performed, in conjunction withthe elimination of equivalent paths, in order to mine candidate kAS . A simple clustering algorithm is used in orderto mine clusters of kAS and a characteristic “word”/kAS for each cluster. In other words, each kAS is an ordered listP = (s1, s2, ..., sk) of edges. For each kAS a “root” edge s1 is determined, and a vector ri = (rx

i , ryi ) of the distance

from the midpoint of s1 to si is determined. Similarly an orientation θi and length li is determined for each si. Thus,the measure used to determine the similarity D(a, b) between two kAS Pa, Pb is given by

D(a, b) = wr

k∑i=2

‖rai − rb

i ‖ + wθ

k∑i=1

Dθ(θai , θ

bi ) +

k∑i=1

| log(lai /lbi )|. (25)

where Dθ(θai , θ

bi ) ∈ [0, π/2] is the difference between the orientations of the corresponding segments in kAS ‘a’ and

‘b’. As with many algorithms in the literature, the algorithm focuses on building a detector for a single viewpoint. Aninteresting observation of the authors is that as the resolution of the tiles/cells used to split an input image increases,the spatial localization ability of kAS grows stronger, thus, accommodating for less spatial variability in the objectclass. This implies that there exists an optimal number of cells, suggesting a tradeoff between optimal localization andtolerance to intraclass variation. The authors also observe that as k increases, the optimal number of cells in which theimage is split has to decrease. Notice that this behaviour on the part of recognition algorithms is predicted in [45] andin [26] where the influence of object class complexity, sensor noise, scene complexity and various time constraintson the capabilities of recognition algorithms are examined rigorously, thus proving that these factors place certainfundamental limits on what one can expect from recognition systems in terms of their reliability. The authors of [367]conclude their paper by comparing their object localization algorithm to the algorithm by Dalal and Triggs [335], anddemonstrating that their algorithm performs favourably.

Compared to the Pascal competitions from previous years, a push towards the use of more complex hierarchiesis evident in Pascal 2007. The use of these hierarchies resulted in improved performance. Despite the belief onthe part of many researchers that finding truly universal words/part-based representations has proven a failure so far,their research indicates that for class specific datasets these representations can be of help. Within the context ofFig.1, these hierarchies represent a more complex type of feature grouping. Effectively the authors are using similar

82

Figure 50: It is easier to understand the left image’s contents (e.g., a busy road with mountains in the background) if the cars in the image have beenfirstly localized. Conversely, in the right image, occlusions make the object localization problem difficult. Thus, prior knowledge that the imagecontains exclusively cars, can make the localization problem easier (from [361] with permission).

Figure 51: Demonstrating how top-down category-specific attentional biases can modulate the shape-words during the bag-of-words histogramconstruction (from [358] with permission).

low level features (e.g., edges, color) and they are grouping them in more complex ways in order to achieve moreuniversal representations of object parts. In terms of object verification and object hypothesizing (see Fig.1) thework by Felzenszwalb et al. [366] represents the most successful approach tested in Pascal 2007, for using a coarsegenerative model of object parts to improve recognition performance.

4.2.4. Pascal 2008Harzallah et al. [360, 361] present a framework in which the outputs of object localization and classification

algorithms are combined to improve each other’s results. For example, knowing the type of image can help improvethe localization of certain objects (see Fig. 50). Motivated by the cascade of classifiers proposed by Viola and Jones[227, 235, 236] (see Sec. 2.11) the authors propose a low-computational cost linear SVM classifier for pre-selection ofregions, followed by a costly but more reliable non-linear SVM (based on a χ2 kernel) for scoring the final localizationoutput, providing a good trade-off between speed and accuracy. A winning image classifier from VOC 2007 is used forthe image classification algorithm. Objects are represented using a combination of shape and appearance descriptors.Shape descriptors consist of HOG descriptors calculated over 40 and 350 overlapping or non-overlapping tiles (theauthors compare various approaches for splitting the image into tiles). The appearance descriptors are built using SIFTfeatures that are quantized into “words” and calculated over multiple scales. These words are used to construct visualword histograms summarizing the content of each one of the tiles. The authors note that overlapping square tiles seemto give the best performance. The number of positive training set examples used by the linear SVM is artificiallyincreased and a procedure for retaining only the hard negative examples during training, is presented. The final imageclassification and localization probabilities are combined via simple multiplication, to obtain the probability of havingan object in an image given the window’s score (localization) and the image’s score (classification). Various resultspresented by the authors show that the combination of the two improves in general the localization and classification

83


Scalability Efficiency Efficiency Scalability Primitives or Context 3D TextureZhang et al. [142] ** * ** ** ** * * ***Dalal and Triggs [335] ** * ** ** ** * * **Leibe et al. [199] ** ** ** *** ** * * *Laaksonen et al. [263] ** ** ** ** *** * * **Perronnin and Dance [364] ** * ** ** ** * * **Chum and Zisserman [365] *** *** ** *** ** * * **Felzenszwalb et al. [366] *** *** ** ** ** * * **Ferrari et al. [367] *** ** ** *** *** * * *van de Weijer and Schmid [368] ** * ** ** *** * * *Viitaniemi and Laaksonen [265] ** * * * *** * * **Harzallah et al. [361] *** ** ** ** *** *** * **Tahir et al. [342] ** ** *** ** *** * * **Felzenszwalb et al. [363] *** *** *** ** *** *** * **Vedaldi et al. [356] *** * * ** *** ** * **Wang et al. [357] ** *** *** *** *** * * *Khan et al. [358] *** ** ** ** ** * * *van de Sande et al. [351] ** **** ** ** *** * * **Bourdev and Malik [352] *** ** ** *** ** * *** **Perronnin et al. [354] *** ** **** ** ** * * **Zhu et al. [348] *** ** *** *** ** * * *Chen et al. [349] *** ** *** *** ** * * **Song et al. [350] *** ** *** ** **** *** * ***

Table 7: Comparing some of the more distinct algorithms of Sec.4.2 along a number of dimensions. For each paper, and where applicable, 1-4stars (*,**,***, ****) are used to indicate the strength/expended effort along the corresponding dimension. These often implicitly denote why aparticular paper became well known. Where appropriate, a not-applicable label (N/A) is used. Inference Scalability: The focus of the paper onimproving the robustness of the algorithm as the scene complexity or the object class complexity increases. Search Efficiency: The use of intelligentstrategies to decrease the time spent localizing an object when the corresponding algorithm is used for localization. If it is a detection algorithm,this refers to its localization efficiency within the context of a sliding-window approach (i.e., the degree of the use of intelligent strategies to improvedetection efficiency). Training Efficiency: The level of automation in the training process, and the speed with which the training is done. EncodingScalability: The encoding length of the object representations as the number of objects increases or as the object representational fidelity increases.Diversity of Indexing Primitives: The distinctiveness and number of indexing primitives used. Uses Function or Context: The degree to whichfunction and context influences the algorithm. Uses 3D: The degree to which depth/range/3D information is used by the algorithm for inference ormodel representations. Uses Texture: The degree to which texture discriminating features are used by the algorithm.

results for both VOC 2007 and VOC 2008.Tahir et al. [362, 342] propose the use of Spectral Regression combined with Kernel Discriminant Analysis (SR-

KDA) for classifying images in a particular class. The authors show that this classifier is appropriate for large scalevisual category recognition, since its training is much faster than the SVM-based approaches that they tested, while atthe same time achieving at least as good performance as SVMs. This makes SR-KDA approaches a straightforwardreplacement of the SVM modules often used in the literature. The image representation is based on classical interestpoint detection, combined with various extensions of the SIFT descriptor, combined with a visual codebook extractionphase. The algorithm achieves top ranked performance on PASCAL VOC 2008 and Mediamill challenge. Within thecontext of Fig.1, the main innovation evident in the top-ranked algorithms of Pascal 2008 lies in their use of morepowerful discriminative classifiers which enabled an improvement of the object verification modules.

4.2.5. Pascal 2009Felzenszwalb et al. [363] present an extension of their previous work [366]. In contrast to their earlier work, they

now use stochastic gradient descent to perform the latent SVM training. Furthermore, they investigate the use of PCA-based dimensionality reduction techniques to transform the object representation vectors and obtain lower dimensionalvectors for representing the image cells. They also introduce the use of contextual knowledge to improve objectlocalization performance. They achieve this by obtaining the set of localizations from k detections, thus constructinga related “context” vector from these scores, and then using this vector in conjunction with a quadratic-kernel based

84

SVM to rescore the images. The authors test their algorithm on various PASCAL challenge datasets, achievingcomparatively excellent performance.

Vedaldi et al. [356] investigate the use of a combination of kernels, where each kernel corresponds to a differentfeature channel (such as bag of visual words, dense words, histograms of oriented edges and self-similarity features).The use of combinations of multiple kernels results in excellent performance, demonstrating that further research onkernel methods has a high likelihood of further improving the performance of vision systems. Similarly to the workin [360, 361], the authors use a cascade of progressively more costly but more accurate kernels (linear, quasi-linearand non-linear kernels) to efficiently localize the objects. However, as the authors note, further work could be done toreduce the computational complexity of the framework. This algorithm also results in comparatively excellent resultson the PASCAL datasets it was tested on.

Similarly, Wang et al. [357] present the Locality-constrained Linear Coding (LLC) approach for obtaining sparserepresentations of scenes. These sparse bases are obtained through the projection of the data onto various localcoordinate frames. Linear weight combinations of these bases are used to reconstruct local descriptors. The authorsalso propose a fast approximation to LLC which speeds up the LLC computations significantly. An SVM is used toclassify the resulting images’ descriptors, achieving top-ranked performance when tested with various benchmarks.

Khan et al. [358, 359] attempt to bridge the gap between the bottom-up bags-of-words paradigms which have beenquite successful in the PASCAL challenges, by incorporating a top-down attention mechanism that can selectively biasthe features extracted in an image based on their dominant color (see Fig. 51). As the authors point out, the two mainapproaches for fusing color and shape information into a bag-of-words representation is via early fusion (where jointshape-color descriptors are used) and via late fusion (where histogram representations of color and shape are simplyconcatenated). Given separate vocabularies for shape and color, each training image’s corresponding color histogramis estimated and a class specific posterior p(class|word) is estimated. By concatenating the posteriors for all the colorwords of interest, the corresponding low-level features are primed. Difference of Gaussian detectors, Harris Laplacedetectors and SIFT descriptors are used to obtain the shape descriptors. The Color Name and HUE descriptor are usedas color descriptors [368, 376, 377]. A standard χ2 SVM is used for classifying images. These top-down approachesare compared to early-fusion based approaches that combine SIFT descriptors with color descriptors, and which areknown to perform well [378]. It is shown that for certain types of images the top-down priming can result in drasticclassification improvements.

Within the context of Fig.1, it is evident that during Pascal 2009 there was a significant shift towards more com-plex object representations and more complex object inference and verification algorithms. This is evidenced by theincorporation of top down priming mechanisms, complex kernels that incorporate contextual knowledge, as well as bynovel local sparse descriptors which achieved top-ranked performance. Consistent in all this work is the preference inusing SVMs for contextual classification, model building during training, as well as object recognition, demonstratingthat the use of SVMs has become more subtle and less monolithic compared to early recognition algorithms.

4.2.6. Pascal 2010Perronnin et al. [354] present an extension of their earlier work [364], which we have already described in this

section. The modifications they introduce, result in an increase of over 10% in the average precision. An interestingaspect of this work is that during the 2010 challenge the work was also trained on its own dataset (non-VOC related)and subsequently tested successfully on various tasks, demonstrating the algorithm’s ability to generalize. The authorsachieve these results by using linear classifiers. This last point is important since linear SVMs have a training costof O(N), while non-linear SVMs have a training cost of around O(N2) to O(N3), where N is the number of trainingimages. Thus, training non-linear SVMs becomes impractical with tens of thousands of training images. The au-thors achieve this improvement in their results by normalizing the respective gradient vectors first described in [364].Another problem with the gradient representation is the sparsity of many vector dimensions. As a result the authorsapply to each dimension a function f (z) = sign(z)|z|α for some α ∈ [0, 1], which results in a significant classificationimprovement.

A number of other novel ideas were tested within the context of the 2010 Pascal challenge. van de Sande et al.[351] proposed a selective search algorithm for efficiently searching within a single image, without having to exhaus-tively search the entire image (see Secs. 2.11,3.2 for more related work). They achieve this by adopting segmentationas a selective search strategy, so that rather than aiming for a few accurate object localizations, they generate more ap-proximate object localizations, thus placing a higher emphasis on high recall rates. A novel object-class-specific part

85

Figure 52: (a)The 3-layer tree-like object representation in [348]. (b) A reference template without any part displacement, showing the root-nodebounding box (blue), the centers of the 9 parts in the 2nd layer (yellow dots), and the 36 part at the last layer in color purple. (c) and (d) denoteobject localizations (from [348] with permission).

representations was also introduced for human pose estimation [352, 353]. It achieved state-of-the-art performancefor localizing people, demonstrating the significance of properly choosing the object representations.

Overall, in the top-ranked systems of Pascal 2010 there is evidence of an effort to mitigate the effects of trainingset biases. This has motivated Perronnin et al. [354] to test the generalization ability of their system even when trainedon a non Pascal related dataset. Approaches proposed to improve the computational complexity of training and onlinesearch algorithms include the use of combinations of linear and non-linear SVMs as well as various image searchalgorithms. Within the context of Fig.1, this corresponds to ways of improving the hypothesis generation and objectverification process.

4.2.7. Pascal 2011Zhu et al. [348] present an incremental concave-convex procedure (iCCP) which enables the authors to efficiently

learn both two and three layer object representations. The authors demonstrate that their algorithm outperforms themodel by Felzenszwalb et al. [363]. These results are used by the authors as evidence that deep-structures (3-layers)are better than 2-layer based object representations (see Fig. 52). The authors begin their exposition by describingthe task of structural SVM learning. Let (x1, y1, h1),...,(xN , yN , hN) ∈ X × Y × H denote training samples wherethe xi are training patches, the yi are class labels, and hi = (Vi, ~pi) with Vi denoting a viewpoint and ~pi denotingthe positions of object parts. In other words, the hi encode the spatial arrangement of the object representation. Instructural SVM learning the task is to learn a function Fw(x) = arg maxy,h[w · Φ(x, y, h)] where Φ is a joint featurevector encoding the relation between the input x and the structure (y, h). In practice Φ encodes spatial and appearanceinformation similarly to [363]. If the structure information h is not labelled in the training set (as the case usually issince in training data we are customarily only given the bounding box of the object of interest and not part-relationinformation) then, we deal with the latent structural SVM problem, where we need to solve

minw

12‖w‖2 + C

N∑i=1

[maxy,h

[w · Φi,y,h + Li,y,h] −

maxh

[w · Φi,yi,h]] (26)

where C is a constant penalty value, Φi,y,h = Φ(xi, y, h) and Li,y,h = L(yi, y, h) is the loss function which is equal to 1iff yi = y. The authors use some previous results from the latent structural SVM training literature: By splitting theabove expression in two terms, they iteratively find a hyperplane (a function of w) which bounds the last max term(which is concave in terms of w), replace the max term with this hyperplane, solve the resulting convex problem, andrepeat the process. This trains the model and enables the authors to use Fw to localize objects in an image, achieving

86

Figure 53: On using context to mitigate the negative effects of ambiguous localizations [350]. The greater the ambiguities, the greater rolecontextual knowledge plays (from [350] with permission).

comparatively excellent results. Chen et al. [349] present a similar latent hierarchical model which is also solved usinga concave-convex procedure, and whose results are comparable to other state of the art algorithms. The latent-SVMprocedure is again used to learn the hierarchical object representation. A top-down dynamic programming algorithmis used to localize the objects.

Song et al. [350] present a paper on using context to improve image classification and object localization perfor-mance when we are dealing with ambiguous situations where methodologies that do not use context tend to fail (seeFig. 53). The authors report top-ranked results on the PASCAL VOC 2007 and 2010 datasets. In more detail, theauthors present the Contextualized Support Vector Machine. In general, SVM based classification assumes the useof a fixed hyperplane wT

0 · x f + b = 0. Given an image Xi specific feature vector x fi and image specific contextual

information vector xci the authors adapt the vector w0 into a vector wi = Pxc

i + w0 that is based on xci and a transfor-

mation matrix P. Matrix P =∑R

r=1 ur · qTr is constrained as a low rank matrix with few parameters and as a result

wi = w0+∑R

r=1(qTr xc

i )·ur Thus, the SVM margin of image Xi is γi = yi(wT0 x f

i +∑R

r=1(qTr xc

i )·(uTr x f

i )+b) where yi ∈ −1, 1is a class label. The authors define each vector ur so that for unambiguous features x f

i , the scalar (uTr x f

i ) takes a smallvalue close to 0, in which case γi ≈ yi(wT

0 x fi + b). Thus, only for ambiguous images is contextual knowledge used.

The context vector length is equal to the number of objects we are searching for, and is built using a straightforwardsearch for the highest confidence location in an image of each one of those objects. The authors specify an iterativemethodology for adapting these vectors wi until satisfactory performance is achieved. The significant improvementsthat this methodology offers demonstrate how important context is in the object recognition problem.

Overall, in one of the top-ranked approaches of Pascal 2011, Zhu et al. [348] demonstrated that even deeperhierarchies are achievable. They showed that such hierarchies can provide even better results than another top-rankedPascal competition algorithm [363]. Within the context of Fig.1, the work by Zhu et al. [348] provides an approach forbuilding deeper hierarchies which affect the grouping, hypothesis generation and verification modules of the standardrecognition pipeline. Song et al. [350] provided an elegant way for adaptively controlling the object hypothesismodule, by using context as an index that adaptively selects a different classifier that is appropriate for the currentcontext.

4.3. The Evolving Landscape

In 1965, Gordon Moore stated that the number of transistors that could be incorporated per integrated circuitwould increase exponentially with time [379, 380]. This provided one of the earliest technology roadmaps for semi-conductors. Even earlier, Engelbart [381] made a similar prediction on the miniaturization of circuitry. Engelbartwould later join SRI and found the Augmentation Research Center (ARC) which is widely credited as a pioneer in thecreation of the modern Internet era computing, due to the center’s early proposals for the mouse, videoconferencing,interactive text editing, hypertext and networking [382]. As Engelbart would later point out, it was his early predictionon the rapid increase of computational power that convinced him on the promise of the research topics later pursuedby his ARC laboratory. The early identification of trends and shifts in technology, can provide a competitive edge forany individual or corporation. The question arises as to whether we are currently entering a technological shift of thesame scope and importance as the one identified by Moore and Engelbart fifty years ago.

87

For all intents and purposes, Moore’s law is coming to an end. While Moore’s law is still technically valid,since multicore technologies have enabled circuit designers to inexpensively pack more transistors on a single chip,this no longer leads to commensurate increases in application performance. Moore’s law has historically provided avital technology roadmap that influenced the agendas of diverse groups in academia and business. Today, fifty yearsafter the early research on object recognition systems, we are simultaneously confronted with the end to Moore’slaw and with a gargantuan explosion in multimedia data growth [253]. Fundamental limits on processing speed,power consumption, reliability and programmability are placing severe constraints on the evolution of the computingtechnologies that have driven economic growth since the 1950s [383]. It is becoming clear that traditional von-Neumann architectures are becoming unsuitable for human-level intelligence tasks, such as vision, since the machinecomplexity in terms of the number of gates and their power requirements tends to grow exponentially with the sizeof the input and the environment complexity [384, 383]. The question for the near future is that of determining towhat extent the end to Moore’s law will lead to a significant evolution in vision research that will be capable ofaccommodating the shifting needs of industry. As the wider vision community slowly begins to address this fact, itwill define the evolution of object recognition research, it will influence the vision systems that remain relevant, and itwill lead to significant changes in vision and computer science education in general by affecting other related researchareas that are strongly dependent on vision (such as robotics).

According to the experts responsible for the International Technology Roadmap for Semiconductors [384], themost promising future strategy for chip and system design is that of complementing current information technologywith low-power computing systems inspired by the architecture of the brain [383]. How would von-Neumann ar-chitectures compare to a non von-Neumann architecture that emulates the organization of the organic brain? Thetwo architectures should be suitable for complementary applications. The complexity of neuromorphic architecturesshould increase more gradually with increasing environment complexity, and it should tolerate noise and errors [383].However, such neuromorphic architectures would likely not be suitable for high precision numerical analysis tasks.Modern von-Neumann computing precipitates the need for a program that relies on synchronous, serial, centralized,hardwired, general purpose and brittle circuits [385]. The brain architecture on the other hand relies on neurons andsynapses operating in a mixed digital-analog mode, is asynchronous, parallel, fault tolerant, distributed, slow, andwith a blurred distinction between CPU and memory (as compared to von-Neumann architectures) since the memoryis, to a large extent, represented by the synaptic weights.

How does our current understanding of the human brain differentiate it from typical von-Neumann architectures?Turing made the argument that since brains are computers then brains are computable [386]. But if that is indeedthe case, why do reliable image understanding algorithms still elude us? Churchland [387] and Hawkins [388] arguethat general purpose AI is difficult because (i) computers must have a large knowledge base which is difficult toconstruct, and because (ii) it is difficult to extract the most relevant and contextual information from such a knowledgebase. As it was demonstrated throughout our discussion on object recognition systems, the problem of efficient objectrepresentations and efficient feature extraction constitutes a central tenet of any non-trivial recognition system, whichsupports the viewpoint of Churchland and Hawkins.

There is currently a significant research thrust towards the construction of neuromorphic systems, both at thehardware and the software level. This is evidenced by recent high-profile projects, such as EU funding of the humanbrain project with over a billion Euros over 10 years [383], U.S. funding for the NIH BRAIN Initiative [389], and bythe growing interest in academia and industry for related projects [390, 391, 392, 393, 394, 395, 396, 397]. The appealof neuromorphic architectures lies in [398] (i) the possibility of such architectures achieving human like intelligence byutilizing unreliable devices that are similar to those found in neuronal tissue, (ii) the ability of neuromorphic strategiesto deal with anomalies, caused by noise and hardware faults for example, and (iii) their low-power requirements, dueto their lack of a power intensive bus and due to the blurring of a distinction between CPU and memory.

Vision and object recognition should assume a central role in any such research endeavour. About 40% of theneocortex is devoted to visual areas V1, V2 [388], which in turn are devoted just to low-level feature extraction. It isthus reasonable to argue that solving the general AI problem is similar in scope to solving the image understandingproblem (see Sec.1). Current hardware and software architectures for vision systems are unable to scale to the massivecomputational resources required for this task. The elegance of the solution to the vision problem is astounding. Thehuman neocortex consists of 80% of the human brain, which has around 100 billion neurons and 1014 synapses, con-sumes just 20-30 Watts, and is to a large extent self trained [399]. One of the most astounding results in neuroscienceis attributable to Mountcastle [400, 388, 401]. By investigating the detailed anatomy of the neocortex, he was able to

88

show that the micro-architecture of the regions looks extremely similar regardless of whether a region is for vision,hearing or language. Mountcastle proposed that all parts of the neocortex operate based on a common principle, withthe cortical column being the unit of computation. What distinguishes different regions is simply their input (whethertheir input is vision based, auditory based etc.). From a machine learning perspective this is a surprising and puzzlingresult, since the no-free-lunch theorem, according to which it is best to use a problem specific optimization/learningalgorithm, permeates much of the machine learning research. In contrast the neocortex seems to rely on a singlelearning architecture for all its tasks and input modalities. Looking back at the object recognition algorithms surveyedin this paper, it becomes clear that no mainstream vision system comes close to achieving the generalization abilitiesof the neocortex. This sets the stage for what may well become one of the most challenging and rewarding scientificendeavours of this century.

5. Conclusion

We have presented a critical overview of the object recognition literature, pointed out some of the major challengesfacing the community and emphasized some of the characteristic approaches attempted over the years for solving therecognition problem. We began the survey by discussing how the needs of industry led to some of the earliest indus-trial inspection and character recognition systems. It is pleasantly surprising to note that despite severe limitations inCPU speeds and sensor quality, such early systems were astoundingly accurate, thus contributing to the creation of thefield of computer vision, with object recognition assuming a central role. We pointed out that recognition systems per-form well in controlled environments but have been unable to generalize in less controlled environments. Throughoutthe survey we have discussed various proposals set forth by the community on possible causes and solutions to thisproblem. We continued by surveying some of the characteristic classical approaches for solving the problem. We thendiscussed how this led to the realization that more control over the data acquisition process is needed. This realizationcontributed to the popularity of active and attentive systems. We noted how this led to a stronger confluence betweenthe vision and robotics community and surveyed some relevant systems. We continued the survey by discussing somecommon testing strategies and fallacies that are associated with recognition systems. We concluded by discussing insome depth some of the most successful recognition systems that have been openly tested in various object recog-nition challenges. As we alluded in the previous section, titillating evidence from neuroscience indicates that thereis an elegant solution to the vision problem that should also be capable of spanning the full AI problem (e.g., voicerecognition, reasoning, etc.), thus providing the necessary motivation for a radical rethinking of the strategies used bythe community in tackling the problem.

In Tables 1-7 we compared some of the more distinct recognition algorithms along a number of dimensions whichcharacterize each algorithm’s ability to bridge the so-called semantic gap: the inability of less complex but easilyextractable indexing primitives, to be grouped/organized so that they provide more high-level and more powerfulindexing primitives. This dilemma has directly or indirectly influenced much of the literature (also see Fig.1). Thisis exemplified, for example, by CBIR systems which rely on low-level indexing primitives for efficiency reasons.This is also exemplified by the fact that no recognition system has consistently demonstrated graceful degradationas the scene complexity increases, as the number of object classes increases, and as the complexity of each objectclass increases [26]. While there is significant success in building robust exemplar recognition systems, the successin building generic recognition systems is questionable. Furthermore, from Tables 1-7 we notice that few papershave attempted to address to a large extent all the dimensions of robust recognition systems. For example, in morerecent systems, the role of 3D parts-based representations has significantly diminished. Within this context, activerecognition systems were proposed as an aid in bridging the semantic gap, by adding a greater level of intelligence tothe data acquisition process. However, in practice, and as it is exemplified from Tables 5 and 6, very few such systemscurrently address the full spectrum of recognition sub-problems. As the role of the desktop computer is diminished andthe role of mobile computing becomes more important, a commensurate increase in the importance of power-efficientsystems emerges. A power-efficient solution to the recognition problem precipitates significant advancements to allof the above mentioned problems.

Within the context of the vision problem, recognition constitutes the most difficult but also the most rewardingproblem, since most vision problems can be reformulated in terms of the recognition problem (albeit perhaps not asefficiently). Some general comments are in place. The recognition and vision problem is highly interdisciplinary,

89

spanning the fields of machine learning and decision making under uncertainty, robotics, signal processing, mathe-matics, statistics, psychology, neuroscience, HCI, databases, supercomputing and visualization/graphics. The highlyinterdisciplinary nature of the problem is both an advantage and a disadvantage. It is an advantage due to the vastresearch opportunities it gives to the experienced vision practitioner. It is a disadvantage because the diversity of thefield makes it all the more pertinent that the practitioner is careful and sufficiently experienced in identifying researchthat can advance the field.

Based on the above survey, we reach a number of conclusions: (i) The solution to the recognition problem willrequire significant advances in the representation of objects, the inference and learning algorithms used, as well as thehardware platforms used to execute such systems. In general, artificial recognition systems are still far removed fromthe elegance, and generalization capabilities that solutions based on the organic brain are endowed with. (ii) The issueof bridging the semantic gap between low level image features and high level object representations keeps re-emergingin the literature. Such low-level indexing primitives are easy to extract from images but are often not very powerfulindexing primitives (see Fig.1). In contrast high level object representations are significantly more powerful indexingprimitives, but efficiently learning object models based on such primitives, and extracting such primitives from images,remains a difficult problem. The dilemma of indexing strength vs. system efficiency permeates the recognitionliterature and plays a decisive role in the design of commercial systems, such as Content Based Image Retrievalsystems. (iii) A parts-based hierarchical modelling of objects will almost certainly play a central role in the problem’ssolution and the bridging of the semantic gap. While such models have shown some success in distinguishing betweena small number of classes, they generally fail as the scene complexity increases, as the number of object classesincreases and as the similarity between the object classes increases. (iv) For each neuron in the neocortex therecorrespond on average 10,000 synapses, thus demonstrating that there is a significant gap in terms of the input sizeand the computational resources needed to reliably process the input. Active and attentive approaches can helpvision systems cope with many of the intractable aspects of passive approaches to the vision problem by reducing thecomplexity of the input space. An active approach to vision can help solve real world problems such as degeneracies,occlusions, varying illumination and extreme variations in object scale.

A great deal of the research on passive recognition has focused, to some extent, on the feature selection stageof the recognition problem without taking into consideration the effects of various cost constraints discussed in thepaper. Virtually all the research on active object recognition has only attempted to optimize a small number ofextrinsic camera parameters while assuming that the recognition algorithm is a rather static black box. More workon investigating the confluence of the two sets of parameters could potentially lead to more efficient search strategies.Finally, the survey has supported the view that the computational complexity of vision algorithms must constitute acentral guiding principle during the construction of such systems.

Acknowledgements

We thank the reviewers for their insightful comments and suggestions that helped us improve the paper. A.A. firstsubmitted this paper while he was affiliated with York University.

References

[1] R. Graves, The Greek Myths: Complete Edition, Penguin, 1993.[2] L. G. Roberts, Pattern Recognition With An Adaptive Network, in: Proc. IRE International Convention Record, 66–70, 1960.[3] J. T. Tippett, D. A. Borkowitz, L. C. Clapp, C. J. Koester, A. J. Vanderburgh (Eds.), Optical and Electro-Optical Information Processing,

MIT Press, 1965.[4] L. G. Roberts, Machine Perception of Three Dimensional Solids, Ph.D. thesis, Massachusetts Institute of Technology, 1963.[5] M. Ejiri, Machine Vision in Early Days: Japan’s Pioneering Contributions, in: Proc. 8th Asian Conference on Computer Vision (ACCV),

2007.[6] S. Kashioka, M. Ejiri, Y. Sakamoto, A transistor wire-bonding system utilizing multiple local pattern matching techniques, IEEE Trans.

Syst. Man Cybern. 6 (8) (1976) 562–570.[7] G. Gallus, Contour analysis in pattern recognition for human chromosome classification, Appl Biomed Calcolo Electronico 2.[8] G. Gallus, G. Regoliosi, A Decisional Model of Recognition Applied to the Chromosome Boundaries, J Histochem Cytochem 22.[9] A. Jimenez, R. Ceres, J. Pons, A survey of computer vision methods for locating fruits on trees, IEEE Transactions of the ASABE 43 (6)

(2000) 1911–1920.[10] E. N. Malamas, E. G. M. Petrakis, M. Zervakis, L. Petit, J-D.Legat, A survey on industrial vision systems, applications and tools, Image and

Vision Computing 21 (2) (2003) 171–188.

90

[11] T. McInerney, D. Terzopoulos, Deformable Models in Medical Image Analysis: A Survey, Medical Image Analysis 1 (2) (1996) 91–108.[12] A. Andreopoulos, J. K. Tsotsos, Efficient and Generalizable Statistical Models of Shape and Appearance for Analysis of Cardiac MRI,

Medical Image Analysis 12 (3) (2008) 335–357.[13] O. D. Trier, A. K. Jain, T. Taxt, Feature extraction methods for character recognition-A survey, Pattern Recognition 29 (4) (1996) 641–662.[14] S. Mori, H. Nishida, H. Yamada, Optical Character Recognition, John Wiley and Sons, 1999.[15] K. Takahashi, T. Kitamura, M. Takatoo, Y. Kobayashi, Y. Satoh, Traffic flow measuring system by image processing, in: Proc. IAPR MVA,

245–248, 1996.[16] C.-N. Anagnostopoulos, I. Anagnostopoulos, I. Psoroulas, V. Loumos, E. Kayafas, License Plate Recognition From Still Images and Video

Sequences: A Survey, IEEE Transactions on Intelligent Transportation Systems 9 (3) (2008) 377–391.[17] D. Maltoni, D. Maio, A. K. Jain, S. Prabhakar, Handbook of Fingerprint Recognition, Springer Publishing Company, 2nd edn., 2009.[18] K. W. Bowyer, K. Hollingsworth, P. J. Flynn, Image understanding for iris biometrics: A survey, Computer Vision and Image Understanding

110 (2) (2008) 281–307.[19] C.-L. Lin, K.-C. Fan, Biometric verification using thermal images of palm-dorsa vein patterns, IEEE Transactions on Circuits and Systems

for Video Technology 14 (2) (2004) 199–213.[20] N. Miura, A. Nagasaka, Extraction of Finger-Vein Patterns Using Maximum Curvature Points in Image Profiles, in: IAPR Conference on

Machine Vision Applications, 2005.[21] J. Tsotsos, The Encyclopedia of Artificial Intelligence, chap. Image Understanding, John Wiley and Sons, 641–663, 1992.[22] S. Dickinson, What is Cognitive Science?, chap. Object Representation and Recognition, Basil Blackwell publishers, 172–207, 1999.[23] D. Marr, Vision: A Computational Investigation into the Human Representation and Processing of Visual Information, W. H. Freeman and

Company, 1982.[24] A. Andreopoulos, S. Hasler, H. Wersing, H. Janssen, J. K. Tsotsos, E. Korner, Active 3D Object Localization using a humanoid robot, IEEE

Transactions on Robotics 27 (1) (2011) 47–64.[25] P. Perona, Object Categorization: Computer and Human Perspectives, chap. Visual Recognition circa 2008, Cambridge University Press,

55–68, 2009.[26] A. Andreopoulos, J. K. Tsotsos, A Computational Learning Theory of Active Object Recognition Under Uncertainty, Int. J. Comput. Vision

101 (1) (2013) 95–142.[27] S. Edelman, Object Categorization: Computer and Human Vision Perspectives, chap. On what it means to see, and what we can do about it,

Cambridge University Press, 3–24, 2009.[28] J. K. Tsotsos, On the Relative Complexity of Active vs. Passive Visual Search, Int. J. Comput. Vision 7 (2) (1992) 127–141.[29] J. K. Tsotsos, A Computational Perspective on Visual Attention, MIT Press, 2011.[30] Aristotle, Περι Ψυχης (On the Soul), 350 B.C.[31] R. Bajcsy, Active Perception, Proceedings of the IEEE 76 (8) (1988) 966–1005.[32] J. Aloimonos, A. Bandopadhay, I. Weiss, Active Vision, Int. J. Comput. Vision 1 (1988) 333–356.[33] J. M. Findlay, I. D. Gilchrist, Active Vision: The psychology of looking and seeing, Oxford University Press, 2003.[34] F. Brentano, Psychologie vom Empirischen Standpunkt, Meiner, Leipzig .[35] H. Barrow, R. Popplestone, Relational Descriptions in Picture Processing, Machine Intelligence 6 (1971) 377–396.[36] T. Garvey, Perceptual strategies for purposive vision, Tech. Rep., Technical Note 117, SRI Int’l., 1976.[37] J. Gibson, The ecological approach to visual perception, Houghton Mifflin, Boston, 1979.[38] R. Nevatia, T. Binford, Description and Recognition of Curved Objects, Artificial Intelligence 8 (1977) 77–98.[39] R. Brooks, R. Greiner, T. Binford, The ACRONYM Model-Based Vision System, in: Proc. of 6th Int. Joint Conf. on Artificial Intelligence,

1979.[40] I. Biederman, Recognition-by-Components: A Theory of Human Image Understanding, Psychological Review 94 (1987) 115–147.[41] K. Ikeuchi, T. Kanade, Automatic Generation of Object Recognition Programs, in: IEEE, vol. 76, 1016–1035, 1988.[42] R. Bajcsy, Active Perception vs. Passive Perception, in: IEEE Workshop on Computer Vision Representation and Control, Bellaire, Michi-

gan, 1985.[43] D. Ballard, Animate Vision, Artificial Intelligence 48 (1991) 57–86.[44] S. Soatto, Actionable Information in Vision, in: Proc. IEEE Int. Conf. on Computer Vision, 2009.[45] A. Andreopoulos, J. K. Tsotsos, A Theory of Active Object Localization, in: Proc. IEEE Int. Conf. on Computer Vision, 2009.[46] L. Valiant, Deductive Learning, Philosophical Transactions of the Royal Society of London 312 (1984) 441–446.[47] L. Valiant, A theory of the learnable, Communications of the ACM 27 (11) (1984) 1134–1142.[48] L. Valiant, Learning disjunctions of conjunctions, in: Proc. 9th International Joint Conference on Artificial Intelligence, 1985.[49] S. Dickinson, D. Wilkes, J. Tsotsos, A Computational Model of View Degeneracy, IEEE Trans. Patt. Anal. Mach. Intell. 21 (8) (1999)

673–689.[50] E. Dickmanns, Dynamic Vision for Perception and Control of Motion, Springer-Verlag, London, 2007.[51] S. J. Dickinson, A. Leonardis, B. Schiele, M. J. Tarr (Eds.), Object Categorization: Computer and Human Vision Perspectives, Cambridge

University Press, 2009.[52] A. Pinz, Object Categorization, Foundations and Trends in Computer Graphics and Vision 1 (4).[53] A. R. Hanson, E. M. Riseman, Computer Vision Systems, Academic Press, 1977.[54] T. Binford, Visual Perception by Computer, in: IEEE Conference on Systems and Control, Miami, FL, 1971.[55] D. Marr, H. Nishihara, Representation and Recognition of the spatial organization of three dimensional shapes, in: Proceedings of the Royal

Society of London B, vol. 200, 269–294, 1978.[56] R. Brooks, Symbolic Reasoning Among 3-D Models and 2-D Images, Artificial Intelligence Journal 17 (1-3) (1981) 285–348.[57] I. Biederman, M. Bar, One-shot viewpoint invariance in matching novel objects, Vision Research 39 (1999) 2885–2899.[58] W. G. Hayward, M. J. Tarr, Differing views on views: comments on Biederman and Bar (1999), Vision Research 40 (2000) 3895–3899.[59] I. Biederman, M. Bar, Differing views on views: response to Hayward and Tarr (2000), Vision Research 40 (2000) 3901–3905.

91

[60] M. Tarr, Q. Vuong, Steven’s Handbook of Experimental Psychology (3rd ed.), Vol. 1: Sensation and Perception, chap. Visual object recog-nition, John Wiley & Sons, 287–314, 2002.

[61] M. Zerroug, R. Nevatia, Three-dimensional descriptions based on the analysis of the invariant and quasi-invariant properties of some curved-axis generalized cylinders, IEEE Transactions on Pattern Analysis and Machine Intelligence 18 (3) (1996) 237–253.

[62] R. Bolles, R. Horaud, 3DPO: A three-dimensional part orientation system, International Journal of Robotics Research 5 (3) (1986) 26.[63] C. Goad, From Pixels to Predicates, chap. Special Purpose Automatic Programming for 3D model-based vision, Ablex Publishing, 371–391,

1986.[64] D. G. Lowe, Three-dimensional object recognition from single two-dimensional images, Artificial Intelligence 31 (3) (1987) 355–395.[65] D. Huttenlocher, S. Ullman, Recognizing Solid Objects by Alignment with an Image, International Journal of Computer Vision 5 (2) (1990)

195–212.[66] S. Sarkar, K. Boyer, Integration, Inference, and Management of Spatial Information Using Bayesian Networks: Perceptual Organization,

IEEE Transactions on Pattern Analysis and Machine Intelligence 15 (3) (1993) 256–274.[67] W. Grimson, T. Lozano-Perez, Model based recognition and localization from sparse range or tactile data, The International Journal of

Robotics Research 3 (3) (1984) 3–35.[68] T. Fan, G. Medioni, R. Nevatia, Recognizing 3-D Objects using surface descriptors, IEEE Transactions on Pattern Analysis and Machine

Intelligence 11 (11) (1989) 1140–1157.[69] D. Clemens, Region-based feature interpretation for recognizing 3D models in 2D images, Tech. Rep. 1307, MIT AI Laboratory, 1991.[70] H. Blum, A transformation for extracting new descriptors of shape, in: Models for the perception of speech and visual form, MIT press,

1967.[71] J. Koenderink, A. van Doorn, Internal Representation of Solid Shape with Respect to Vision, Biological Cybernetics 32 (4) (1979) 211–216.[72] D. Lowe, Object Recognition from Local Scale-Invariant Features, in: Proc. ICCV, 1999.[73] A. Andreopoulos, J. K. Tsotsos, On Sensor Bias in Experimental Methods for Comparing Interest Point Saliency and Recognition Algo-

rithms, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (1) (2012) 110–126.[74] M. J. Kearns, U. V. Vazirani, An Introduction to Computational Learning Theory, MIT Press, 1994.[75] D. Nister, H. Stewenius, Scalable recognition with a vocabulary tree, in: Proc. IEEE Conference on Computer Vision and Pattern Recogni-

tion, 2006.[76] M. Wertheimer, Untersuchungen zur Lehre von der Gestalt II, Psychologische Forschung 4 (1923) 301–350.[77] W. Kohler, Gestalt Psychology, New York: Liveright, 1929.[78] K. Koffka, Principles of Gestalt Psychology, New York: Harcourt, Brace, 1935.[79] S. Palmer, Vision Science. Photons to Phenomenology, MIT Press, 1999.[80] D. Forsyth, J. Ponce, Computer Vision: A Modern Approach, Prentice Hall, 2003.[81] J. Elder, S. Zucker, A Measure of Closure, Vision Res. 34 (1994) 3361–3369.[82] A. Berengolts, M. Lindenbaum, On the Distribution of Saliency, in: Computer Vision and Pattern Recognition, 2004.[83] A. Berengolts, M. Lindenbaum, On the Distribution of Saliency, IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (12)

(2006) 1973–1990.[84] D. Lowe, The Viewpoint Consistency Constraint, International Journal of Computer Vision 1 (1) (1987) 57–72.[85] J. Canny, A Computational Approach To Edge Detection, IEEE Trans. Pattern Analysis and Machine Intelligence 8 (1986) 679–714.[86] S. X. Yu, J. Shi, Segmentation with Pairwise Attraction and Repulsion, in: International Conference on Computer Vision, 2001.[87] S. X. Yu, J. Shi, Understanding Popout through Repulsion, in: Computer Vision and Pattern Recognition, 2001.[88] P. Verghese, D. Pelli, The Information Capacity of Visual Attention, Vision Research 32 (5) (1992) 983–995.[89] Y. Lamdan, J. Schwartz, H. Wolfson, Affine invariant model-based object recognition, IEEE Transactions on Robotics and Automation 6 (5)

(1990) 578–589.[90] J. Schwartz, M. Sharir, Identification of Partially Obscured Objects in Two and Three Dimensions by Matching Noisy Characteristic Curves,

International Journal of Robotics Research 6 (2) (1986) 29–44.[91] A. Kalvin, E. Schonberg, J. Schwartz, M. Sharir, Two-dimensional Model-Based Boundary Matching Using Footprints, International Journal

of Robotics Research 5 (4) (1986) 38–55.[92] D. Forsyth, J. Mundy, A. Zisserman, C. Coelho, A. Heller, C. Rothwell, Invariant Descriptors for 3-D Object Recognition and Pose, IEEE

Transactions on Pattern Analysis and Machine Intelligence 13 (10) (1991) 971–991.[93] P. Flynn, A. Jain, 3D object recognition using invariant feature indexing of interpretation tables, CVGIP 55 (2) (1992) 119–129.[94] I. Rigoutsos, R. Hummel, A Bayesian Approach to Model Matching with Geometric Hashing, Computer Vision and Image Understanding

61 (7) (1995) 11–26.[95] H. Wolfson, I. Rigoutsos, Geometric hashing: an overview, IEEE Comput. Sc. and Eng. 4 (4) (1997) 10–21.[96] A. Wallace, N. Borkakoti, J. Thornton, TESS: a geometric hashing algorithm for deriving 3D coordinate templates for searching structural

databases. Application to enzyme active sites, Protein Sci. 6 (11) (1997) 2308–2323.[97] U. Grenander, General Pattern Theory, Oxford University Press, 1993.[98] S. Agarwal, D. Roth, Learning a sparse representation for object detection, in: European Conference on Computer Vision, vol. 4, 2002.[99] M. Weber, M. Welling, P. Perona, Towards automatic discovery of object categories, in: IEEE Conference on Computer Vision and Pattern

Recognition, 2000.[100] R. Fergus, P. Perona, A. Zisserman, Object Class Recognition by Unsupervised Scale-Invariant Learning, in: Computer Vision and Pattern

Recognition, 2003.[101] S. Lazebnik, C. Schmid, J. Ponce, Semi-local affine parts for object recognition, in: British Machine Vision Conference, 2004.[102] K. Mikolajczyk, C. Schmid, Scale and Affine invariant interest point detectors, Int. J. Comput. Vision 60 (1) (2004) 63–86.[103] D. G. Pelli, B. Farell, D. C. Moore, The Remarkable Inefficiency of Word Recognition, Nature 423 (2003) 752–756.[104] M. Riesenhuber, T. Poggio, Hierarchical models of object recognition in cortex, Nature neuroscience 2 (11) (1999) 1019–1025.[105] T. Serre, L. Wolf, T. Poggio, Object Recognition with Features Inspired by Visual Cortex, in: Computer Vision and Pattern Recognition,

92

2005.[106] J. Mutch, D. G. Lowe, Multiclass Object Recognition with Sparse Localized Features, in: Computer Vision and Pattern Recognition, 2006.[107] M. A. Fischler, R. A. Elschlager, The representation and matching of pictorial structures, IEEE Transactions on Computers C-22 (1) (1973)

67–92.[108] K. Tanaka, Neuronal mechanisms of object recognition, Science (1993) 685–688.[109] A. Pentland, Perceptual Organization and the representation of natural form, Artificial Intelligence 28 (2) (1986) 293–331.[110] A. Jaklic, A. Leonardis, F. Solina, Segmentation and Recovery of Superquadrics, Springer, 2000.[111] T. Heimann, H.-P. Meinzer, Statistical shape models for 3D medical image segmentation: A review, Medical Image Analysis 13 (4) (2009)

543–563.[112] S. Dickinson, D. Metaxas, Integrating Qualitative and Quantitative Shape Recovery, International Journal of Computer Vision 13 (3) (1994)

1–20.[113] S. Sclaroff, A. Pentland, Modal Matching for Correspondence and Recognition, IEEE Transactions on Pattern Analysis and Machine Intel-

ligence 17 (6) (1995) 545–561.[114] T. Cootes, G. Edwards, C. Taylor, Active Appearance Models, IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (6)

(2001) 681–685.[115] T. Cootes, C. Taylor, D. Cooper, J. Graham, Active Shape Models-Their training and application, Computer Vision and Image Understanding

61 (1) (1995) 38–59.[116] T. Strat, M. Fischler, Context-based vision: recognizing objects using information from both 2D and 3D imagery, IEEE Transactions on

Pattern Analysis and Machine Intelligence 13 (10) (1991) 1050–1065.[117] L. Stark, K. Bowyer, Function-Based Generic Recognition for Multiple Object Categories, CVGIP 59 (1) (1994) 1–21.[118] A. Torralba, P. Sinha, Statistical context priming for object detection, in: Proceedings of the IEEE International Conference on Computer

Vision, 763–770, 2001.[119] A. Torralba, K. Murphy, W. Freeman, M. Rubin, Context-based vision system for place and object recognition, in: ICCV, 2003.[120] A. Torralba, Contextual Priming for Object Detection, International Journal of Computer Vision 53 (2) (2003) 169–191.[121] D. Hoiem, A. A. Efros, M. Hebert, Putting Objects in Perspective, in: Computer Vision and Pattern Recognition, 2006.[122] C. Siagian, L. Itti, Gist: A Mobile Robotics Application of Context-Based Vision in Outdoor Environment, in: Computer Vision and Pattern

Recognition Workshops, 2005.[123] L. Wolf, S. Bileschi, A Critical View of Context, International Journal of Computer Vision 69 (2) (2006) 251–261.[124] M. Minsky, A Framework for Representing Knowledge, Tech. Rep. 306, MIT-AI Laboratory Memo, 1974.[125] D. Koller, N. Friedman, Probabilistic Graphical Models: Principles and Techniques, MIT Press, 2009.[126] A. Hanson, E. Riseman, The VISIONS Image-Understanding System, chap. 1, Lawrence Erlbaum Associates, 1–114, 1988.[127] H. Grabner, J. Gall, L. V. Gool, What Makes a Chair a Chair?, in: Proc. CVPR, 2011.[128] M. Stark, P. Lies, M. Zillich, J. Wyatt, B. Schiele, Functional object class detection based on learned affordance cues, in: Proc. of the 6th

international conference on Computer Vision systems, 2008.[129] J. Gibson, The Theory of Affordances, Erlbaum Associates, 1977.[130] C. Castellini, T. Tommasi, N. Noceti, F. Odone, B. Caputo, Using Object Affordances to Improve Object Recognition, IEEE Transactions

on Autonomous Mental Development 3 (3) (2011) 207–215.[131] B. Ridge, D. Skocaj, A. Leonardis, Unsupervised Learning of Basic Object Affordances from Object Properties, in: Proc. Computer Vision

Winter Workshop, 2009.[132] A. Saxena, J. Driemeyer, A. Ng, Robotic Grasping of Novel Objects using Vision, The International Journal of Robotics Research 27 (2)

(2008) 157–173.[133] E. M. Riseman, A. R. Hanson, Computer Vision Research at the University of Massachussetts, International Journal of Computer Vision 2

(1989) 199–207.[134] S. Z. Li, Markov Random Field Modeling in Image Analysis, Springer-Verlag, 2001.[135] S. Kumar, M. Hebert, Discriminative Random Fields: A Discriminative Framework for Contextual Interaction in Classification, in: Interna-

tional Conference on Computer Vision, 2003.[136] K. P. Murphy, A. Torralba, W. T. Freeman, Using the forest to see the trees: a graphical model relating features, objects and scenes, in:

NIPS, 2003.[137] L. Li, L. F. Fei, What, where and who? classifying events by scene and object recognition, in: ICCV, 2007.[138] j. Shotton, M. Johnson, R. Cipolla, Semantic texton forests for image categorization and segmentation, in: CVPR, 2008.[139] G. Heitz, D. Koller, Learning spatial context: Using stuff to find things, in: ECCV, 2008.[140] S. Divvala, D. Hoiem, J. Hays, A. Efros, M. Hebert, An empirical study of context in object detection, in: CVPR, 2009.[141] H. Murase, S. Nayar, Visual Learning and Recognition of 3-D Objects From Appearance, IJCV 14 (1995) 5–24.[142] J. Zhang, M. Marszalek, S. Lazebnik, C. Schmid, Local Features and Kernels for Classification of Texture and Object Categories: A

Comprehensive Study, International Journal of Computer Vision 73 (2) (2007) 213–238.[143] W. Niblack, R. Barber, W. Equitz, M. Fickner, E. Glasman, D. Petkovic, P. Yanker, The QBIC project: Querying images by content using

color, texture and shape, in: SPIE Conference on Geometric Methods in Computer Vision II, 1993.[144] M. Pontil, A. Verri, Support vector machines for 3D object recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence

20 (6) (1998) 637–646.[145] B. Schiele, J. Crowley, Recognition without correspondence using multidimensional receptive field histograms, International Journal of

Computer Vision 36 (1) (2000) 31–50.[146] M. Turk, A. Pentland, Face Recognition using Eigenfaces, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition, 1991.[147] C. Huang, O. Camps, T. Kanungo, Object Recognition Using Appearance-Based Parts and Relations, in: Proceedings of the IEEE Computer

Vision and Pattern Recognition Conference, 877–883, 1997.[148] A. Leonardis, H. Bischof, Robust Recognition Using Eigenimages, Computer Vision and Image Understanding 78 (1) (2000) 99–118.

93

[149] S. Zhou, R. Chellappa, B. Moghaddam, Adaptive Visual Tracking and Recognition using Particle Filters, in: International Conference onMultimedia and Expo, 2003.

[150] R. P. N. Rao, D. H. Ballard, An Active Vision Architecture based on Iconic Representations, Artificial Intelligence 78 (1) (1995) 461–505.[151] C. Schmid, R. Mohr, Local Grayvalue Invariants for Image Retrieval, IEEE Transactions on Pattern Analysis and Machine Intelligence

19 (5) (1997) 530–535.[152] G. Carneiro, A. Jepson, Phase-Based Local Features, in: ECCV, 2002.[153] F. Rothganger, S. Lazebnik, C. Schmid, J. Ponce, Object modeling and recognition using local affine-invariant image descriptors and multi-

view spatial constraints, International Journal of Computer Vision .[154] R. C. Nelson, A. Selinger, A Cubist Approach to Object Recognition, in: Proc. International Conference on Computer Vision, Bombay,

India, 614–621, 1998.[155] S. Belongie, J. Malik, J. Puzicha, Shape Matching and Object Recognition Using Shape Contexts, IEEE Transactions on Pattern Analysis

and Machine Intelligence 24 (4) (2002) 509–522.[156] R. K. McConnell, Method of and apparatus for pattern recognition (U.S. Patent No. 4,567,610), 1986.[157] W. Freeman, M. Roth, Orientation histograms for hand gesture recognition, in: Proc. IEEE Intl. Workshop on Automatic Face and Gesture

Recognition, 296–301, 1995.[158] K. Mikolajczyk, C. Schmid, An Affine Invariant Interest Point Detector, in: European Conference on Computer Vision, 2002.[159] P. Torr, A. W. Fitzgibbon, A. Zisserman, Maintaining Multiple Motion Model Hypotheses Over Many Views to Recover Matching and

Structure, in: International Conference on Computer Vision, 1998.[160] A. Thomas, V. Ferrari, B. Leibe, T. Tuytelaars, B. Schiele, L. V. Gool, Towards Multi-View Object Class Detection, in: Computer Vision

and Pattern Recognition, 2006.[161] V. Ferrari, T. Tuyelaars, L. V. Gool, Integrating Multiple Model Views for Object Recognition, in: Computer Vision and Pattern Recognition,

2004.[162] V. Ferrari, T. Tuytelaars, L. V. Gool, Wide-Baseline Multiple-View Correspondences, in: Computer Vision and Pattern Recognition, 2003.[163] T. Tuytelaars, L. V. Gool, Wide Baseline Stereo Matching based on Local Affinely Invariant Regions, in: British Machine Vision Conference,

2000.[164] J. Matas, O. Chum, M. Urban, T. Pajdla, Robust Wide Baseline Stereo from Maximally Stable Extremal Regions, in: British Machine Vision

Conference, 2002.[165] S. Se, D. Lowe, J. Little, Mobile Robot Localization and Mapping with Uncertainty using Scale-Invariant Visual Landmarks, The Interna-

tional Journal of Robotics Research 21 (8) (2002) 735–758.[166] F. Li, J. Kosecka, Probabilistic Location Recognition Using Reduced Feature Set, in: IEEE International Conference on Robotics and

Automation, 2006.[167] W. Zhang, J. Kosecka, Image Based Localization in Urban Environments, in: International Symposium on 3D Data Processing, Visualization

and Transmission, 2006.[168] A. Smeulders, M. Worring, S. Santini, A. Gupta, R. Jain, Content-Based Image Retrieval at the End of the Early Years, IEEE Transactions

on Pattern Analysis and Machine Intelligence 22 (12) (2000) 1349–1380.[169] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, C. Bray, Visual categorization with bags of keypoints, in: ECCV International Workshop

on Statistical Learning in Computer Vision, 2004.[170] J. Sivic, A. Zisserman, Video Google: A Text Retrieval Approach to Object Matching in Videos, in: International Conference on Computer

Vision, 2003.[171] K. Grauman, T. Darrell, Unsupervised Learning of Categories from Sets of Partially Matching Image Features, in: Computer Vision and

Pattern Recognition, 2006.[172] I. Kokkinos, A. Yuille, Scale Invariance without Scale Selection, in: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2008.[173] C. Lampert, M. Blaschko, T. Hofmann, Efficient Subwindow Search: A Branch and Bound Framework for Object Localization, IEEE

Transactions on Pattern Analysis and Machine Intelligence 31 (2009) 2129–2142.[174] R. Fergus, P. Perona, A. Zisserman, A Sparse Object Category Model for Efficient Learning and Exhaustive Recognition, in: Computer

Vision and Pattern Recognition, 2005.[175] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, W. T. Freeman, Discovering objects and their location in images, in: International

Conference on Computer Vision, 2005.[176] S. Ullman, M. Vidal-Naquet, E. Sali, Visual Features of Intermediate Complexity and their use in classification, Nature neuroscience 5 (7)

(2002) 682–687.[177] P. F. Felzenszwalb, D. P. Huttenlocher, Pictorial Structures for Object Recognition, International Journal of Computer Vision 61 (1) (2005)

55–79.[178] B. Leibe, B. Schiele, Interleaved Object Categorization and Segmentation, in: British Machine Vision Conference, 2003.[179] F. Li, J. Kosecka, H. Wechsler, Strangeness Based Feature Selection for Part Based Recognition, in: Computer Vision and Pattern Recogni-

tion, 2006.[180] V. Ferrari, T. Tuytelaars, L. V. Gool, Simultaneous Object Recognition and Segmentation by Image Exploration, in: European Conference

on Computer Vision, 2004.[181] K. Siddiqi, A. Shokoufandeh, S. Dickinson, S. Zucker, Shock Graphs and Shape Matching, International Journal of Computer Vision 30

(1999) 1–24.[182] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-Based Learning Applied to Document Recognition, Proceedings of the IEEE 86 (11)

(1998) 2278–2324.[183] B. Ommer, M. Sauter, J. Buhmann, Learning Top-Down Grouping of Compositional Hierarchies for Recognition, in: CVPR, 2006.[184] B. Ommer, J. Buhmann, Learning the Compositional Nature of Visual Objects, in: CVPR, 2007.[185] J. Deng, S. Satheesh, A. Berg, L. Fei-Fei, Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition, in: NIPS,

2011.

94

[186] J. Deng, A. Berg, L. F. Fei, Hierarchical Semantic Indexing for Large Scale Image Retrieval, in: CVPR, 2011.[187] E. Bart, I. Porteous, P. Perona, M. Welling, Unsupervised Learning of Visual Taxonomies, in: CVPR, 2008.[188] E. Bart, M. Welling, P. Perona, Unsupervised organization of image collections: taxonomies and beyond, IEEE Transactions on Pattern

Analysis and Machine Intelligence 33 (11) (2011) 2302–2315.[189] Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, J. Dean, A. Ng, Building High-level Features Using Large Scale

Unsupervised Learning, in: ICML, 2012.[190] C. Lampert, M. Blaschko, A Multiple Kernel Learning Approach to Joint Multi-class Object Detection, in: DAGM, 2008.[191] C. Lampert, M. Blaschko, T. Hofmann, Beyond Sliding Windows: Object Localization by Efficient Subwindow Search, in: CVPR, 2008.[192] C. Lampert, Detecting Objects in Large Image Collections and Videos by Efficient Subimage Retrieval, in: ICCV, 2009.[193] N. Pinto, D. Cox, J. DiCarlo, Why is Real-World Visual Object Recognition Hard?, PLoS Computational Biology 4 (1) (2008) 151–156.[194] A. Torralba, A. Efros, Unbiased Look at Dataset Bias, in: IEEE Conference on Computer Vision and Pattern Recognition, 2011.[195] R. Fergus, P. Perona, A. Zisserman, A Visual Category Filter for Google Images, in: European Conference on Computer Vision, 2004.[196] A. Opelt, A. Pinz, A. Zisserman, Incremental Learning of Object Detectors using a Visual Shape Alphabet, in: Computer Vision and Pattern

Recognition, 2006.[197] F.-F. Li, R. Fergus, P. Perona, Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested

on 101 Object Categories, in: Computer Vision and Pattern Recognition Workshops, 2004.[198] B. Leibe, B. Schiele, Scale-Invariant Object Categorization Using a Scale-Adaptive Mean-Shift Search, in: DAGM, 2004.[199] B. Leibe, A. Leonardis, B. Schiele, Combined Object Categorization and Segmentation with an Implicit Shape Model, in: ECCV Workshop

on Statistical Learning in Computer Vision, 2004.[200] R. Fergus, L. Fei-Fei, P. Perona, A. Zisserman, Learning Object Categories from Google’s Image Search, in: International Conference on

Computer Vision, 2005.[201] F. Jurie, B. Triggs, Creating Efficient Codebooks for Visual Recognition, in: International Conference on Computer Vision, 2005.[202] M. Isard, PAMPAS: Real-Valued Graphical Models for Computer Vision, in: Computer Vision and Pattern Recognition, 2003.[203] R. Fergus, P. Perona, A. Zisserman, Weakly Supervised Scale-Invariant Learning of Models for Visual Recognition, International Journal of

Computer Vision 71 (3) (2007) 273–303.[204] E. Bienenstock, S. Geman, D. Potter, Compositionality, MDL Priors, and Object Recognition, in: NIPS, 1997.[205] S.-C. Zhu, D. Mumford, A Stochastic Grammar of Images, Foundations and Trends in Computer Graphics and Vision 2 (4) (2007) 259–362.[206] P. Laplace, Essai philosophique sur les probabilites, 1812.[207] K. Fu, Syntactic Pattern Recognition and Applications, Prentice Hall, 1982.[208] H. Blum, Biological Shape and Visual Science, J. Theoretical Biology 38 (1973) 207–285.[209] M. Leyton, A process grammar for shape, Artificial Intelligence 34 (1988) 213–247.[210] T. B. Sebastian, P. Klein, B. Kimia, Recognition of Shapes by Editing their Shock Graphs, IEEE Transactions on Pattern Analysis and

Machine Intelligence 26 (5) (2004) 550–571.[211] M. Pelillo, K. Siddiqi, S. Zucker, Matching hierarchical structures using association graphs, IEEE PAMI 21 (11) (1999) 1105–1120.[212] D. Macrini, A. Shokoufandeh, S. Dickinson, K. Siddiqi, S. Zucker, View-Based 3-D Object Recognition using Shock Graphs, in: Proc.

International Conference on Pattern Recognition, 2002.[213] F. Demicri, A. Shokoufandeh, Y. Keselman, L. Bretzner, S. Dickinson, Object Recognition as Many-to-Many feature matching, International

Journal of Computer Vision .[214] Y. Keselman, S. Dickinson, Generic Model Abstraction from Examples, IEEE Transactions on Pattern Analysis and Machine Intelligence:

special issue on Syntactic and Structural pattern recognition 27 (7).[215] A. Shokoufandeh, D. Macrini, S. Dickinson, K. Siddiqi, S. Zucker, Indexing Hierarchical Structures Using Graph Spectra, IEEE Transactions

on Pattern Analysis and Machine Intelligence 27.[216] K. Fukushima, Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,

Biological Cybernetics 36 (4) (1980) 193–202.[217] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, L. Jackel, Backpropagation Applied to Handwritten Zip Code

Recognition, Neural Computation 1 (4) (1989) 541–551.[218] H. Wersing, E. Korner, Learning Optimized Features for Hierarchical Models of Invariant Object Recognition, Neural Computation 15 (7)

(2003) 1559–1588.[219] S. Fidler, G. Berginc, A. Leonardis, Hierarchical Statistical Learning of Generic Parts of Object Structure, in: Computer Vision and Pattern

Recognition, 2006.[220] I. Kokkinos, A. Yuille, HOP: Hierarchical Object Parsing, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2009.[221] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber, A Novel Connectionist System for Unconstrained Hand-

writing Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (5) (2009) 855–868.[222] Y. LeCun, Y. Bengio, Convolutional Networks for Images, Speech, and Time-Series, in: The Handbook of Brain Theory and Neural

Networks, MIT Press, 1995.[223] T. Avraham, M. Lindenbaum, Dynamic Visual Search Using Inner-Scene Similarity: Algorithms and Inherent Limitations, in: ECCV, 2004.[224] T. Avraham, M. Lindenbaum, Attention-Based Dynamic Visual Search Using Inner-Scene Similarity: Algorithms and Bounds, IEEE Trans-

actions on Pattern Analysis and Machine Intelligence 28 (2) (2006) 251–264.[225] T. Avraham, M. Lindenbaum, Esaliency - A Stochastic Attention Model Incorporating Similarity Information and Knowledge-Based Pref-

erences, in: International Workshop on the Representation and Use of Prior Knowledge in Vision, 2006.[226] J. Duncan, G. Humphreys, Visual Search and Stimulus Similarity, Psychological Rev. 96 (1989) 433–458.[227] P. Viola, M. Jones, Rapid Object Detection using a Boosted Cascade of Simple Features, in: Computer Vision and Pattern Recognition,

2001.[228] B. A. Draper, J. Bins, K. Baek, ADORE: Adaptive Object Recognition, in: ICVS, 1999.[229] L. Paletta, G. Fritz, C. Seifert, Cascaded sequential attention for object recognition with informative local descriptors and q-learning of

95

grouping strategies, in: CVPR, 2005.[230] L. Paletta, G. Fritz, C. Seifert, Q-learning of sequential attention for visual object recognition from informative local descriptors, in: ICML,

2005.[231] C. Greindl, A. Goyal, G. Ogris, L. Paletta, Cascaded Attention and Grouping for Object Recognition from Video, in: ICIAP, 2003.[232] C. Bandera, F. J. Vico, J. M. Bravo, M. E. Harmon, L. C. B. III, Residual Q-learning applied to visual attention, in: ICML, 1996.[233] T. Darrell, Reinforcement Learning of Active Recogntion Behaviors, in: NIPS, 1995.[234] H. D. Tagare, K. Toyama, J. G. Wang, A Maximum-Likelihood Strategy for Directing Attention during Visual Search, IEEE Transactions

on Pattern Analysis and Machine Intelligence 23 (5) (2001) 490–500.[235] P. Viola, M. Jones, Robust Real-time Object Detection, in: Second International Workshop on Statistical and Computational Theories of

Vision -Modeling, Learning, Computing and Sampling, 2001.[236] P. Viola, M. J. Jones, D. Snow, Detecting Pedestrians Using Patterns of Motion and Appearance, International Journal of Computer Vision

63 (2) (2005) 153–161.[237] A. Torralba, K. P. Murphy, W. T. Freeman, Sharing features: efficient boosting procedures for multiclass object detection, in: Computer

Vision and Pattern Recognition, 2004.[238] A. Opelt, A. Pinz, A. Zisserman, A boundary-fragment-model for object detection, in: European Conference on Computer Vision, 2006.[239] A. Andreopoulos, J. K. Tsotsos, Active Vision for Door Localization and Door Opening Using Playbot: A Computer Controlled Wheelchair

for People with Mobility Impairments, in: Proc. 5th Canadian Conference on Computer and Robot Vision, 2008.[240] Y. Amit, D. Geman, A Computational Model for Visual Selection, Neural Computation 11 (1999) 1691–1715.[241] J. H. Piater, Visual Feature Learning, Ph.D. thesis, University of Massachusetts Amherst, 2001.[242] F. Fleuret, D. Geman, Coarse-to-fine face detection, International Journal of Computer Vision 41 (1-2) (2001) 85–107.[243] J. Sullivan, A. Blake, M. Isard, J. MacCormick, Bayesian object localisation in images, Int. J. Computer Vision 44 (2) (2001) 111–135.[244] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, P. Yanker, Query

by image and video content: The QBIC system, IEEE Computer 28 (9) (1995) 23–32.[245] A. Gupta, R. Jain, Visual Information Retrieval, Communications of the ACM 40 (5) (1997) 70–79.[246] S. Mukherjea, K. Hirata, Y. Hara, Amore: A World Wide Web image retrieval engine, in: Proc. International World Wide Web Conference,

1999.[247] A. Pentland, R. Picard, S. Sclaroff, Photobook: Tools for content-based manipulation of image databases, in: Proc. of the Conference on

Storage and Retrieval for Image and Video Database ——, SPIE, 1994.[248] J. Smith, S.-F. Chang, Visualseek: A fully automated content-based image query system., in: Proc. of the ACM International Conference on

Multimedia, 1997.[249] J. Wang, G. Wiederhold, O. Firschein, S. Wei, Content-Based image indexing and searching using Daubechies’ wavelets, Int. J. Digital

Librar. 1 (4) (1998) 311–328.[250] W. Ma, B. Manjunath, Netra: A toolbox for navigating large image databases, in: Proc. IEEE International Conference on Image Processing,

1997.[251] J. Laaksonen, M. Koskela, S. Laakso, E. Oja, Picsom - content-based image retrieval with self-organizing maps, Pattern Recognition Letters

21 (2000) 1199–1207.[252] T. Judd, F. Durand, A. Torralba, A Benchmark of Computational Models of Saliency to Predict Human Fixations, Tech. Rep. TR-2012-001,

MIT-CSAIL, 2012.[253] P. Zikopoulos, D. deRoos, K. P. Corrigan, Harness the Power of Big Data: The IBM Big Data Platform, McGraw-Hill, 2012.[254] URL www.comscore.com, .[255] A. Blaser, Database Techniques for Pictorial Applications, in: Lecture Notes in Computer Science, vol. 81, Springer Verlag, 1979.[256] R. Jain, Visual Information Management Systems, in: Proc. US NSF Workshop, 1992.[257] M. S. Lew, N. Sebe, C. Djeraba, R. Jain, Content-based multimedia information retrieval: State of the art and challenges, ACM Transactions

on Multimedia Computing 2 (1) (2006) 1–19.[258] R. Datta, D. Joshi, J. Li, J. Z. Wang, Image Retrieval: Ideas, Influences, and Trends of the New Age, ACM Computing Surveys 40 (2) (2008)

1–60.[259] R. C. Veltkamp, M. Tanase, Content-based image retrieval systems: A survey., Tech. Rep., Department of Computer Science, Utrecht

University, 2002.[260] D. Huijsmans, N. Sebe, How to complete performance graphs in content-based image retrieval: Add generality and normalize scope, IEEE

Trans. Patt. Anal. Mach. Intell. 27 (2) (2005) 245–251.[261] H. Tamura, S. Mori, T. Yamawaki, Texture features corresponding to visual perception, IEEE Transactions on Systems, Man and Cybernetics

8 (6) (1978) 460–473.[262] A. Pentland, R. W. Picard, S. Sclaroff, Photobook: Content-based manipulation of image databases, International Journal of Computer

Vision 18 (3) (1996) 233–254.[263] J. Laaksonen, M. Koskela, E. Oja, PicSOM - Self-Organizing Image Retrieval With MPEG-7 Content Descriptors, IEEE Transactions on

Neural Networks 13 (4) (2002) 841–853.[264] V. Viitaniemi, J. Laaksonen, Techniques for Still Image Scene Classification and Object Detection, in: ICANN, 2006.[265] V. Viitaniemi, J. Laaksonen, Techniques for Image Classification, Object Detection and Object Segmentation, in: Visual Information Sys-

tems. Web-Based Visual Information Search and Management, 2008.[266] D. Wilkes, J. Tsotsos, Behaviours for Active Object Recognition, in: SPIE Conference, 225–239, 1993.[267] S. D. Roy, S. Chaudhury, S. Banerjee, Isolated 3D Object Recognition through Next View Planning, IEEE Trans. Syst. Man Cybern. Part A:

Syst. Humans 30 (1) (2000) 67–76.[268] E. Dickmanns, URL : http://www.dyna-vision.de/, .[269] H. Meissner, E. Dickmanns, Control of an unstable plant by computer vision, in: Image Sequence Processing and Dynamic Scene Analysis,

Springer-Verlag, Berlin, 532–548, 1983.

96

[270] E. Dickmanns, A. Zapp, Guiding Land Vehicles Along Roadways by Computer Vision, in: Proc. Congres Automatique, 1985.[271] E. Dickmanns, A. Zapp, A curvature-based scheme for improving road vehicle guidance by computer vision, in: Proc. Mobile Robots, SPIE,

1986.[272] B. Mysliwetz, E. Dickmanns, A vision system with active gaze control for real-time interpretation of well structured dynamic scenes, in:

Proc. 1st conference on intelligent autonomous systems (IAS-1), 1986.[273] E. Dickmanns, B. Mysliwetz, Recursive 3-D road and relative ego-state recognition, IEEE Trans. Patt. Anal. Mach. Intell. 14 (2) (1992)

199–213.[274] M. Lutzeler, E. Dickmanns, Road recognition with MarVEye, in: Proc. Intern. Conf. on Intelligent Vehicles, 1998.[275] F. Thomanek, E. Dickmanns, D. Dickmanns, Multiple Object Recognition and Scene Interpretation for Autonomous Road Vehicle Guidance,

in: Proc. of Int. Symp. on Intelligent Vehicles, 1994.[276] J. Schick, E. Dickmanns, Simultaneous Estimation of 3-D Shape and Motion of Objects by Computer Vision, in: IEEE Workshop on Visual

Motion, 1991.[277] S. Werner, S. Furst, D. Dickmanns, E. Dickmanns, A vision-based multi-sensor machine perception system for autonomous aircraft landing

approach, in: Enhanced and Synthetic Vision AeroSense, 1996.[278] S. Furst, S. Werner, D. Dickmanns, E. Dickmanns, Landmark navigation and autonomous landing approach with obstacle detection for

aircraft, in: Proc. AeroSense, 1997.[279] F. Callari, F. Ferrie, Active Recognition: Looking for Differences, Int. J. of Comput. Vision 43 (3) (2001) 189–204.[280] S. Dickinson, H. Christensen, J. Tsotsos, G. Olofsson, Active Object Recognition Integrating Attention and Viewpoint Control, Comput.

Vis. and Image Und. 67 (3) (1997) 239–260.[281] B. Schiele, J. Crowley, Transinformation for Active Object Recognition, in: Proc. Int. Conf. on Computer Vision, 1998.[282] H. Borotschnig, L. Paletta, M. Prantl, A. Pinz, Active Object Recognition in Parametric Eigenspace, in: Proc. British Machine Vision

Conference, 629–638, 1998.[283] L. Paletta, M. Prantl, Learning Temporal Context in Active Object Recognition Using Bayesian Analysis, in: International Conference on

Pattern Recognition, 2000.[284] S. D. Roy, S. Chaudhury, S. Banerjee, Recognizing Large 3D Objects through Next View Planning using an Uncalibrated Camera, in: Proc.

ICCV, 2001.[285] S. D. Roy, N. Kulkarni, Active 3D Object Recognition using Appearance Based Aspect Graphs, in: Proc. ICVGIP, 40–45, 2004.[286] S. A. Hutchinson, A. Kak, Planning Sensing Strategies in a Robot Work Cell with Multi-Sensor Capabilities, IEEE Transactions on Robotics

and Automation 5 (6) (1989) 765–783.[287] K. Gremban, K. Ikeuchi, Planning Multiple Observations for Object Recognition, International Journal of Computer Vision 12 (2/3) (1994)

137–172.[288] S. Herbin, Recognizing 3D Objects by Generating Random Actions, in: CVPR, 1996.[289] S. Kovacic, A. Leonardis, F. Pernus, Planning Sequences of Views for 3D Object Recognition and Pose Determination, Pattern Recognition

31 (10) (1998) 1407–1417.[290] J. Denzler, C. M. Brown, Information Theoretic Sensor Data Selection for Active Object Recognition and State Estimation, IEEE Transac-

tions on Pattern Analysis and Machine Intelligence 24 (2) (2002) 145–157.[291] C. Laporte, T. Arbel, Efficient Discriminant Viewpoint Selection for Active Bayesian Recognition, International Journal of Computer Vision

68 (3) (2006) 267–287.[292] A. K. Mishra, Y. Aloimonos, Active Segmentation, International Journal of Humanoid Robotics 6 (3) (2009) 361–386.[293] A. K. Mishra, Y. Aloimonos, C. Fermuller, Active Segmentation for Robotics, in: IROS, 2009.[294] X. Zhou, D. Comaniciu, A. Krishnan, Conditional Feature Sensitivity: A Unifying View on Active Recognition and Feature Selection, in:

ICCV, 2003.[295] Microsoft Kinect, URL http://www.xbox.com/en-us/kinect, .[296] J. Tang, S. Miller, A. Singh, P. Abbeel, A textured object recognition pipeline for color and depth image data, in: ICRA, 2012.[297] N. Silberman, R. Fergus, Indoor scene segmentation using a structured light sensor, in: Proc ICCV Workshops, 2011.[298] L. Xia, C.-C. Chen, J. Aggarwal, Human detection using depth information by Kinect, in: Proc. Computer Vision and Pattern Recognition

Workshops, 2011.[299] K. Lai, L. Bo, X. Ren, D. Fox, Sparse Distance Learning for Object Recognition Combining RGB and Depth Information, in: Proc. ICRA,

2011.[300] F. Callari, F. Ferrie, Active Recognition: Using Uncertainty to reduce ambiguity, in: Proc. ICPR, 1996.[301] L. Paletta, A. Pinz, Active object recognition by view integration and reinforcement learning, Robotics and Autonomous Systems 31 (2000)

71–86.[302] R. D. Rimey, C. M. Brown, Control of Selective Perception Using Bayes Nets and Decision Theory, Int. J. Comput. Vision 12 (2/3) (1994)

173–207.[303] L. E. Wixson, D. H. Ballard, Using Intermediate Objects to Improve the Efficiency of Visual Search, Int. J. Comput. Vision 12 (2/3) (1994)

209–230.[304] K. Sjoo, A. Aydemir, P. Jensfelt, Topological spatial relations for active visual search, Robotics and Autonomous Systems 60 (9) (2012)

1093–1107.[305] K. Brunnstrom, T. Lindeberg, J.-O. Eklundh, Active Detection and Classsification of Junctions by Foveation with a Head-Eye System

Guided by the Scale-Space Primal Sketch, in: ECCV, 1992.[306] K. Brunnstrom, J.-O. Eklundh, T. Uhlin, Active fixation for scene exploration, International Journal of Computer Vision 17 (2) (1996)

137–162.[307] Y. Ye, J. Tsotsos, Sensor Planning for 3D Object Search, Computer Vision and Image Understanding 73 (2) (1999) 145–168.[308] S. Minut, S. Mahadevan, A Reinforcement Learning Model of Selective Visual Attention, in: International Conference on Autonomous

Agents, 2001.

97

[309] T. Kawanishi, H. Murase, S. Takagi, Quick 3D object detection and localization by dynamic active search with multiple active cameras, in:International Conference on Pattern Recognition, 2002.

[310] S. Ekvall, P. Jensfelt, D. Kragic, Integrating Active Mobile Robot Object Recognition and SLAM in Natural Environments, in: Proc.Intelligent Robots and Systems, 2006.

[311] D. Meger, P. Forssen, K. Lai, S. Helmer, S. McCann, T. Southey, M. Baumann, J. Little, D. Lowe, Curious George: An attentive semanticrobot, in: Proc. Robot. Auton. Syst., 2008.

[312] P. Forssen, D. Meger, K. Lai, S. Helmer, J. Little, D. Lowe, Informed visual search: Combining attention and object recognition, in: Proc.IEEE International Conference on Robotics and Automation, 2008.

[313] F. Saidi, O. Stasse, K. Yokoi, F. Kanehiro, Online Object Search with a Humanoid Robot, in: Proc. Intelligent Robots and Systems, 2007.[314] H. Masuzawa, J. Miura, Observation Planning for Efficient Environment Information Summarization, in: Proc. IEEE/RSJ Int. Conf. on

Intelligent Robots and Systems, 5794–5800, 2009.[315] K. Sjoo, D. G. Lopez, C. Paul, P. Jensfelt, D. Kragic, Object Search and Localization for an Indoor Mobile Robot, Journal of Computing

and Information Technology 1 (2009) 67–80.[316] J. Ma, T. H. Chung, J. Burdick, A probabilistic framework for object search with 6-DOF pose estimation, The International Journal of

Robotics Research 30 (10) (2011) 1209–1228.[317] K. Ozden, K. Schindler, L. V. Gool, Multibody Structure-from-Motion in Practice, PAMI 32 (6) (2010) 1134–1141.[318] A. Yarbus, Eye Movements and Vision, Plenum. New York., 1967.[319] D. Bruckner, m. Vincze, I. Hinterleitner, Towards Reorientation with a Humanoid Robot, Leveraging Applications of Formal Methods,

Verification adn Validation (2012) 156–161.[320] J.-K. Yoo, J.-H. Kim, Fuzzy Integral-Based Gaze Control Architecture Incorporated With Modified-Univector Field-Based Navigation for

Humanoid Robots, IEEE Transactions on Systems Science and Cybernetics, Part B: Cybernetics 42 (1) (2012) 125–139.[321] J. Malik, Interpreting Line Drawings of Curved Objects, International Journal of Computer Vision 1 (1) (1987) 73–104.[322] A. Andreopoulos, Active Object Recognition in Theory and Practice, Ph.D. thesis, York University, January 2011.[323] J. Najemnik, W. S. Geisler, Optimal eye movement strategies in visual search, Nature 434 (2005) 387–391.[324] M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, A. Zisserman, The PASCAL VIsual Object Classes (VOC) Challenge, Int. J. Comput.

Vision 88 (2).[325] L. Fei-Fei, R. Fergus, P. Perona, Caltech 101 dataset, URL : http://www.vision.caltech.edu/Image Datasets/

Caltech101/Caltech101.html, .[326] G. Griffin, A. Holub, P. Perona, Caltech 256 dataset, URL : http://www.vision.caltech.edu/Image Datasets/ Caltech256/, .[327] A. F. Smeaton, P. Over, W. Kraaij, TRECVID dataset, URL : http://www-nlpir.nist.gov/projects/trecvid/, .[328] C. G. Snoek, M. Worring, J. C. van Gemert, J.-M. Geusebroek, A. W. Smeulders, The Challenge Problem for Automated Detection of 101

Semantic Concepts in Multimedia, in: Proceedings of ACM Multimedia, 2006.[329] B. Yao, X. Yang, S. Zhu, The Lotus Hill dataset, URL : http://www.imageparsing.com/, .[330] M. Sanderson, P. Clough, H. Muller, J. Kalpathy-Cramer, M. Ruiz, D. D. Fushman, S. Nowak, J. Liebetrau, T. Tsikrika, J. Kludas,

A. Popescu, H. Goeau, A. Joly, ImageCLEF dataset, URL : http://www.imageclef.org/, .[331] S. A. Nene, S. K. Nayar, H. Murase, COIL-100 dataset, URL : http://www.cs.columbia.edu/CAVE/software/softlib/

coil-100.php, .[332] B. Leibe, B. Schiele, The ETH-80 dataset, URL : http://www.mis.informatik.tu-darmstadt.de/Research/

Projects/categorization/eth80-db.html, .[333] J. Willamowski, D. Arregui, G. Csurka, C. Dance, L. Fan, Categorizing nine visual classes using local appearance descriptors, in: ICPR

Workshop on Learning for Adaptive Visual Systems, 2004.[334] I. Laptev, T. Lindeberg, KTH action dataset, URL : http://www.nada.kth.se/cvap/actions/, 2004.[335] N. Dalal, B. Triggs, Histograms of Oriented Gradients for Human Detection, in: Computer Vision and Pattern Recognition, 2005.[336] A. Opelt, M. Fussenegger, A. Pinz, P. Auer, Weak hypotheses and boosting for generic object detection and recognition, in: European

Conference on Computer Vision, 2004.[337] B. Russell, A. Torralba, K. Murphy, W. T. Freeman, LabelMe: a database and web-based tool for image annotation, International Journal of

Computer Vision, URL : http://labelme2.csail.mit.edu/Release3.0/ browserTools/php/dataset.php, 2007.[338] A. Torralba, R. Fergus, W. Freeman, 80 million tiny images: a large dataset for non-parametric object and scene recognition, IEEE Trans-

actions on Pattern Analysis and Machine Intelligence, URL : http://groups.csail.mit.edu/vision/TinyImages/, vol. 30(11),2008.

[339] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: A Large-Scale Hierarchical Image Database, IEEE Computer Visionand Pattern Recognition, URL : http://www.image-net.org/, 2009.

[340] B. Yao, X. Jiang, A. Khosla, A. Lin, L. Guibas, L. Fei-Fei, Human Action Recognition by Learning Bases of Action Attributes and Parts,Internation Conference on Computer Vision, URL : http://vision.stanford.edu/Datasets/40actions.html, 2011.

[341] A. Martin, G. Doddington, T. Kamm, M. Ordowski, M. Przybocki, The DET curve in assessment of detection task performance, in: 5thEuropean Conference on Speech Communication and Technology, 1997.

[342] M. Tahir, J. Kittler, K. Mikolajczyk, F. Yan, K. van de Sande, T. Gevers, Visual Category Recognition Using Spectral Regression and KernelDiscriminant Analysis, in: IEEE International Conference on Computer Vision Workshops, 2009.

[343] R. Kasturi, D. B. Goldgof, P. Soundararajan, V. Manohar, J. S. Garofolo, R. Bowers, M. Boonstra, V. N. Korzhova, J. Zhang, Frameworkfor Performance Evaluation of Face, Text, and Vehicle Detection and Tracking in Video: Data, Metrics, and Protocol, IEEE Trans. PatternAnal. Mach. Intell. 31 (2) (2009) 319–336.

[344] V. Mariano, J. Min, J.-H. Park, R. Kasturi, D. Mihalcik, D. Doermann, T. Drayer, Performance Evaluation of Object Detection Algorithms,in: International Conference on Pattern Recognition, 2002.

[345] D. Doermann, D. Mihalcik, Tools and Techniques for Video Performances Evaluation, in: International Conference on Pattern Recognition,2000.

98

[346] MIT, Frontiers in Computer Vision, URL : http://www.frontiersincomputervision.com/, 2011.[347] J. P. A. Ioannidis, Why Most Published Research Findings Are False, PLoS Medicine 2 (8) (2005) 696–701.[348] L. L. Zhu, Y. Chen, A. Yuille, W. Freeman, Latent Hierarchical Structural Learning for Object Detection, in: IEEE Conference on Computer

Vision and Pattern Recognition, 2010.[349] Y. Chen, L. L. Zhu, A. Yuille, Active Mask Hierarchies for Object Detection, in: ECCV, 2010.[350] Z. Song, Q. Chen, Z. Huang, Y. Hua, S. Yan, Contextualizing Object Detection and Classification, in: IEEE Conference on Computer Vision

and Pattern Recognition (CVPR), 2011.[351] K. E. A. van de Sande, J. R. R. Uijlings, T. Gevers, A. W. M. Smeulders, Segmentation as Selective Search for Object Recognition, in: IEEE

International Conference on Computer Vision, 2011.[352] L. Bourdev, J. Malik, Poselets: Body part detectors trained using 3D human pose annotations, in: IEEE 12th International Conference on

Computer Vision, 2009.[353] L. Bourdev, S. Maji, T. Brox, J. Malik, Detecting People Using Mutually Consistent Poselet Activations, in: ECCV, 2010.[354] F. Perronnin, J. Sanchez, T. Mensink, Improving the Fisher Kernel for Large-Scale Image Classification, in: ECCV, 2010.[355] Q. Chen, Z. Song, S. Liu, X. Chen, X. Yuan, T.-S. Chua, S. Yan, Y. Hua, Z. Huang, S. Shen, Boosting Classification with Exclusive Context,

URL : http://pascallin.ecs.soton.ac.uk/challenges/ VOC/voc2010/workshop/nuspsl.pdf, .[356] A. Vedaldi, V. Gulshan, M. Varma, A. Zisserman, Multiple kernels for object detection, in: IEEE International Conference on Computer

Vision, 2009.[357] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, Y. Gong, Locality-constrained Linear Coding for Image Classification, in: Computer Vision and

Pattern Recognition, 2010.[358] F. S. Khan, J. van de Weijer, M. Vanrell, Top-Down Color Attention for Object Recognition, in: IEEE International Conference on Computer

Vision, 2009.[359] F. S. Khan, J. van de Weijer, M. Vanrell, Modulating Shape Features by Color Attention for Object Recognition, Int. J. Comput. Vision .[360] H. Harzallah, C. Schmid, F. Jurie, A. Gaidon, Classification aided two stage localization, URL :

http://pascallin.ecs.soton.ac.uk/challenges/VOC/ voc2008/workshop/harzallah.pdf, .[361] H. Harzallah, F. Jurie, C. Schmid, Combining efficient object localization and image classification, in: IEEE International Conference on

Computer Vision, 2009.[362] M. A. Tahir, K. van de Sande, J. Uijlings, F. Yan, X. Li, K. Mikolajczyk, J. Kittler, T. Gevers, A. Smeulders, UvA & Surrey @ PASCAL

VOC 2008, URL : http://pascallin.ecs.soton.ac.uk/challenges/VOC /voc2008/workshop/tahir.pdf, .[363] P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan, Object Detection with Discriminatively Trained Part-Based Models, IEEE

Transactions on Pattern Analysis and Machine Intelligence 32 (9) (2010) 1627–1645.[364] F. Perronnin, C. Dance, Fisher Kernels on Visual Vocabularies for Image Categorization, in: IEEE Conference on Computer Vision and

Pattern Recognition, 2007.[365] O. Chum, A. Zisserman, An Exemplar Model for Learning Object Classes, in: IEEE Conference on Computer Vision and Pattern Recogni-

tion, 2007.[366] P. Felzenszwalb, D. McAllester, D. Ramanan, A Discriminatively Trained, Multiscale, Deformable Part Model, in: IEEE Conference on

Computer Vision and Pattern Recognition, 2008.[367] V. Ferrari, L. Fevrier, F. Jurie, C. Schmid, Groups of Adjacent Contour Segments for Object Detection, IEEE Transactions on Pattern

Analysis and Machine Intelligence 30 (1) (2008) 36–51.[368] J. van de Weijer, C. Schmid, Coloring Local Feature Extraction, in: ECCV, 2006.[369] M. Everingham, A. Zisserman, C. Williams, L. V. Gool, The Pascal Visual Object Classes Challenge 2006 (VOC2006) Results, URL :

http://pascallin.ecs.soton.ac.uk/challenges/VOC/ voc2006/results.pdf, .[370] M. Everingham, L. V. Gool, C. Williams, A. Zisserman, Pascal Visual Object Classes Challenge Results for 2005, URL :

http://pascallin.ecs.soton.ac.uk/challenges/VOC/ voc2005/results.pdf, .[371] M. Everingham, L. V. Gool, C. Williams, A. Zisserman, Pascal Visual Object Classes Challenge Website, URL :

http://pascallin.ecs.soton.ac.uk/challenges/VOC/, .[372] T. Lindeberg, Feature detection with automatic scale selection, International Journal of Computer Vision 30 (2) (1998) 79–116.[373] D. Lowe, Distinctive image features from scale scale-invariant keypoints, International Journal of Computer Vision 60 (2) (2004) 91–110.[374] S. Lazebnik, C. Schmid, J. Ponce, A sparse texture representation using local affine regions, IEEE Transactions on Pattern Analysis and

Machine Intelligence 27 (8) (2005) 1265–1278.[375] Y. Rubner, C. Tomasi, L. Guibas, The Earth Mover’s distance as a metric for image retrieval, International Journal of Computer Vision 40 (2)

(2000) 99–121.[376] J. van de Weijer, C. Schmid, Applying color names to image description, in: ICIP, 2007.[377] J. van de Weijer, C. Schmid, J. Verbeek, Learning color names from real-world images, in: CVPR, 2007.[378] K. van de Sande, T. Gevers, C. Snoek, Evaluation of color descriptors for object and scene recognition, in: CVPR, 2008.[379] G. Moore, Cramming more Components onto Integrated Circuits, Electronics 38 (8).[380] W. Arden, M. Brillouet, P. Cogez, M. Graef, B. Huizing, R. Mahnkopf, More than Moore White Paper by the IRC, Tech. Rep., International

Technology Roadmap for Semiconductors, 2010.[381] D. Engelbart, Microelectronics and the art of similitude, in: Proc. IEEE International Solid-State Circuits Conference, 1960.[382] J. Markoff, It’s Moore’s Law, but Another Had the Idea First, New York Times April 18 2005, URL

www.nytimes.com/2005/04/18/technology/ 18moore.html.[383] The Human Brain Project: A Report to the European Commission, 2012.[384] International Technology Roadmap for Semiconductors, URL www.itrs.net, 2011.[385] R. Preissl, T. M. Wong, P. Datta, M. Flickner, R. Singh, S. K. Esser, W. P. Risk, H. D. Simon, D. S. Modha, Compass: A scalable simulator

for an architecture for Cognitive Computing, in: Proc. of the International Conference on High Performance Computing, Networking,Storage and Analysis, 2012.

99

[386] A. M. Turing, Computing Machinery and Intelligence, Mind 59 (1950) 433–460.[387] P. Churchland, P. Churchland, Could a Machine Think?, Scientific American 262 (1) (1990) 32–37.[388] J. Hawkins, On Intelligence, Times Books, 2004.[389] Brain Research through Advancing Innovative Neurotechnologies (BRAIN) Initiative, URL http://www.nih.gov/science/brain/, .[390] SpiNNaker, URL http://apt.cs.man.ac.uk/projects/ SpiNNaker/, .[391] FACETS: Fast Analog Computing with Emergent Transient States, URL http://facets.kip.uni-heidelberg.de/index.html, .[392] IFAT 4G, URL http://etienne.ece.jhu.edu/projects/ ifat/index.html, .[393] NEUROGRID, Stanford University, URL http://www.stanford.edu/group/brainsinsilicon/ neurogrid.html, .[394] Brain Corporation, URL http://www.braincorporation.com/, .[395] DARPA Neovision2 project, URL www.darpa.mil/Our Work/DSO/Programs/Neovision2.aspx, .[396] DARPA SyNAPSE project, URL www.darpa.mil/Our Work/DSO/Programs/Systems of Neuromorphic Adaptive Plastic

Scalable Electronics (SYNAPSE).aspx, .[397] IBM Cognitive Computing, URL www.ibm.com/smarterplanet/us/en/business analytics/

article/cognitive computing.html, .[398] International Technology Roadmap for Semiconductors 2011 Edition: Emerging Research Devices, URL

www.itrs.net/Links/2011ITRS/2011Chapters/ 2011ERD.pdf, 2011.[399] Y. Sugita, Face perception in monkeys reared with no exposure to faces, PNAS 105 (1) (2008) 394–398.[400] V. Mountcastle, The Mindful Brain, chap. An Organizing Principle for Cerebral Function: The Unit Model and the Distributed System, MIT

Press, 1978.[401] R. Kurzweil, How To Create a Mind: The Secret of Human Thought Revealed, Viking Penguin, 2012.

Alexander Andreopoulos received an Honours B.Sc. degree (2003) in Computer Science and Mathematics, withHigh Distinction, from the University of Toronto. In 2005 he received his M.Sc. degree and in January 2011 hecompleted his Ph.D. degree, both in Computer Science at York University, Toronto, Canada. During 2011 he workedon the DARPA Neovision2 project. Since January 2012 he has been a researcher at IBM-Almaden, working on theDARPA-SyNAPSE/Cognitive-Computing project. He has received the DEC award for the most outstanding student inComputer Science to graduate from the University of Toronto, a SONY science scholarship, NSERC PGS-M/PGS-Dscholarships and a best paper award.

John K. Tsotsos received his Ph.D. in 1980 from the University of Toronto. He was on the faculty of ComputerScience at the University of Toronto from 1980 to 1999. He then moved to York University appointed as Director ofYork’s Centre for Vision Research (2000-2006) and is currently Distinguished Research Professor of Vision Sciencein the Dept. of Computer Science & Engineering. He is Adjunct Professor in both Ophthalmology and ComputerScience at the University of Toronto. Dr. Tsotsos has published many scientific papers, six conference papers receivingrecognition. He currently holds the NSERC Tier I Canada Research Chair in Computational Vision and is a Fellowof the Royal Society of Canada. He has served on the editorial boards of Image & Vision Computing, ComputerVision and Image Understanding, Computational Intelligence and Artificial Intelligence and Medicine and on manyconference committees. He served as General Chair for IEEE International Conference on Computer Vision 1999.

100

View publication statsView publication stats

https://www.researchgate.net/publication/257484936