Download pdf - 10.1.1.25.4781

Ph.D. Research Proposal:

Learning Object Recognition Models from Images

Art Pope

Advisor: David Lowe

Department of Computer ScienceUniversity of British Columbia

January 1994

Abstract

To recognize an object in an image one must have an internal model of how that object may appear. Thisproposal describes a method for learning such models from training images. The proposed method models

an object by a probability distribution that describes the range of possible variation in the objects appear-ance. This probability distribution is organized on two levels. Large variations are handled by partitioning

the training images into clusters that may correspond to distinctly different views of the object. Within eachcluster, smaller variations are represented by probability distributions that characterize the presence, posi-

tion, and measurements of various discrete features of appearance. The learning process combines an

incremental conceptual clustering algorithm for forming the clusters with a generalization algorithm for

consolidating each clusters training images into a single description. Recognition, which involves match-

ing the features of a training image with those of a clusters description, can employ information about fea-

ture positions, numeric measurements, and relations in order to constrain and speed the search. Preliminary

experiments have been conducted with a system that implements some aspects of the proposed method; the

system can learn to recognize a single characteristic view of an object in the presence of occlusion andclutter.

iii

Contents

1. Introduction .................................................................................................................................... 5

2. Related Research ............................................................................................................................ 8

2.1 Representation ....................................................................................................................... 8

2.1.1 Primitives ................................................................................................................ 8

2.1.2 Coordinate system ................................................................................................... 9

2.1.3 Organization............................................................................................................ 9

2.1.4 Representing object models .................................................................................... 9

2.2 Recognition ........................................................................................................................... 10

2.2.1 Perceptual organization ........................................................................................... 10

2.2.2 Indexing .................................................................................................................. 10

2.2.3 Matching ................................................................................................................. 11

2.2.4 Verifying ................................................................................................................. 12

2.3 Learning to recognize objects ............................................................................................... 12

2.3.1 Learning an appearance representation ................................................................... 12

2.3.2 Learning an appearance classifier ........................................................................... 13

2.3.3 Learning a structural model .................................................................................... 14

2.3.4 Learning a set of characteristic views ..................................................................... 15

2.3.5 Learning a recognition strategy............................................................................... 16

2.3.6 Theoretical learnability ........................................................................................... 16

3. Approach ......................................................................................................................................... 18

3.1 Image representation ............................................................................................................. 19

3.1.1 Concepts .................................................................................................................. 19

3.1.2 Implementation ....................................................................................................... 21

3.2 Model representation ............................................................................................................. 22

3.2.1 Use of characteristic views ..................................................................................... 22

iv

3.2.2 Representation of a characteristic view................................................................... 25

3.3 Coordinate systems ............................................................................................................... 26

3.4 Recognition procedure........................................................................................................... 27

3.4.1 Match quality measure ............................................................................................ 27

3.4.2 Probability density estimation ................................................................................. 31

3.4.3 Matching procedure ................................................................................................ 32

3.4.4 Verification procedure ............................................................................................ 35

3.5 Model learning procedure ..................................................................................................... 36

3.5.1 Clustering training images into characteristic views .............................................. 36

3.5.2 Generalizing training images to form a model graph ............................................. 38

3.5.3 Consolidating a model graph .................................................................................. 38

3.6 Summary ............................................................................................................................... 39

4. Empirical Evaluation ..................................................................................................................... 40

4.1 Preliminary results ................................................................................................................. 40

4.2 Analysis ................................................................................................................................. 43

4.2.1 Better low-level features and verification needed................................................... 43

4.2.2 Better high-level features needed............................................................................ 44

4.2.3 Position information used more effectively to constrain matching ........................ 45

4.3 Proposed evaluation............................................................................................................... 45

5. Plan .................................................................................................................................................. 47

References .................................................................................................................................................. 49

51. Introduction

To recognize an object in an image one must have some expectation of how the object may appear. Thatexpectation is based on an internal model of the objects form or appearance. The research proposed hereseeks to determine how one might construct a system that can acquire such models of objects directly fromintensity images, and then recognize those objects using the models thus acquired. This system would becapable of learning to recognize objects.

The following scenario illustrates how this recognition learning system would operate. One presents to the

system a series of example images, each depicting a particular object from various viewpoints. From thoseexamples the system develops a model of the objects appearance. This training is repeated for each of theobjects the system is to recognize. When presented with a test image, the system can then identify and rankapparent instances of the known objects in the image.

Few object recognition systems have been designed to acquire their models directly from intensity images.Instead, most systems are simply given models in the form of manually-produced shape descriptions.

During recognition, in order to determine whether some object can have a particular appearance, theobjects shape description must be combined with a model of the imaging process that accounts for thesensor, projection, surface reflectance, and illumination. Therefore, successful recognition depends on hav-ing good models not only of the objects, but also of the image formation process. Because modeling thatprocess has proven difficult, most object recognition systems have been restricted to object models that arerelatively simple and coarse.

A system that can learn its models directly from images, on the other hand, will enjoy three importantadvantages.

The learning system will avoid the problem of having to estimate the actual appearance of an objectfrom some idealized model of its shape. Instead, all of the properties of the objects appearanceneeded for recognition will be learned directly by observation. Eliminating the inaccuracies of

appearance estimation may allow recognition to be accomplished more accurately and robustly.

The learning system will acquire new models more conveniently. A new model will be defined not

by measuring and encoding its shape, as with traditional object recognition systems, but by merelydisplaying the object to the system in various poses.

6 The learning system could provide a robot with the ability to learn objects when they are firstencountered and then recognize those objects elsewhere. This ability would be important for a robotroaming in a dynamic and unknown environment.

The difficulty of the recognition learning problem is largely due to the simple fact that an objects appear-ance can vary substantially. It varies with changes in camera position and lighting andif the object is flex-iblewith changes in shape. Further variation is due to noise, and to differences among individual in-

stances of the same type of object. Accommodating this variation will be a central theme in the design of arecognition learning system. The scheme used to represent image content must be stable so that whenever

variation is small, changes in representation are likely to be small also. The scheme used to model an

objects appearance must describe just what variations are possible. The learning procedure must generalizeenough to overcome insignificant variation, but not so much as to confuse dissimilar objects. And the pro-cedure used to identify modeled objects in images must tolerate the likely range of mismatch betweenmodel and image.

In investigating the recognition learning problem I propose to concentrate on one particular version of it:

learning to recognize 3D objects in 2D intensity images. Objects will be recognized solely by the intensityedges they produce (although nothing about the approach I propose here seems to preclude extending it touse other properties too, such as colour and texture). As for the objects themselves, they may be entirelyrigid, composed of a small number of articulations, or somewhat flexible. The system should be able to

learn to recognize a class of similar objects, accommodating the variation among objects just as it accom-modates the variation among images of a single object. This formulation of the recognition learning prob-lem seems general enough to involve all important aspects of the problem and yield a genuinely useful

system, yet restrictive enough to permit a solution to be found within the limits of a Ph.D. program.

The proposed method models an object by a probability distribution that describes the range of possiblevariation in the objects appearance. This probability distribution is organized on two levels. Large varia-tions are handled by partitioning the training images into clusters that may correspond to distinctly different

views of the object, or to different configurations of its parts. Within each cluster, smaller variations arerepresented by probability distributions characterizing various, discrete appearance features. For each

feature we represent the probability of detecting that feature in various positions with respect to the overall

object, and the probability of the feature having various values of feature-specific, numeric measurements.A rich, partially-redundant, and extensible repertoire of features is used to describe appearance.

7The proposed learning method combines an incremental conceptual clustering algorithm for forming the

clusters with a generalization algorithm for consolidating each clusters training images into a single

description. Recognition, which involves matching the features of a training image with those of a clusters

description, can employ information about feature positions, numeric measurements, and relations in order

to constrain and speed the search for a match. Preliminary experiments have been conducted with a system

that implements some aspects of the proposed method; that system can learn to recognize a single view of

an object among other occluding and distracting objects.

In this introduction I have described the problem to be considered, its significance, the source of its diffi-

culty, and the outline of a solution. Chapter 2 discusses relevant work by others on this and similar prob-

lems. Chapter 3 describes the approach I propose for solving the problem. Some aspects of the approach

have already been implemented and tested, as reported in chapter 4. Chapter 5 presents a plan for complet-

ing the investigation.

82. Related Research

Before reviewing related research on learning to recognize objects we will survey some more generalaspects of object recognition. The general survey is organized as two sections: the first on the representa-tion of object models and images, and the second on finding suitable matches between them. These twosections cover a large subject rather quickly while focusing on topics particularly relevant to the researchproposed here. A more thorough review of object recognition has been prepared separately (Pope 1994).

The two sections of general survey are followed by a third specifically on the use of learning in objectrecognition. Since there has been relatively little work done on this aspect of object recognition, the thirdsection can cover its topic in somewhat greater detail than can the preceding two sections; the third section

presents a fairly complete survey of the role of learning in object recognition.

2.1 Representation

Representation schemes are needed both for object models and for image contents. Since these twoschemes must be closely related to facilitate matching, much of the discussion that follows pertains to both.

2.1.1 Primitives

Various criteria favour a representation that employs localized primitives having a variety of types and

scales. These primitives should be tailored to their application in two respects. First, they should make

explicit any information needed to recognize and distinguish objects. Second, they must employ naturalregularities (e.g., smoothness) to achieve stability. Commonly used for representing the shape of 2D inten-sity edges are segments of algebraic curves (including straight lines as well as higher-order curves, such asellipses and general conics); parts or regions (e.g., ribbons formed by the projection of generalized cylin-ders); and various distinguished features such as significant changes in curvature or an application-specificset (e.g., eye, nose, and mouth templates).

For recognition it is desirable to use primitives that are somewhat invariant with respect to changes in pose,

lighting, and other confounding factors. Since it has been shown by Clemens and Jacobs (1991), amongothers, that there is no general invariant suitable for recognizing 3D objects in 2D images (i.e., no propertyof a general arrangement of 3D points that is invariant under projectivities), attention has focused on find-

9ing approximate invariants (e.g., Ben-Arie 1990), and invariants for suitably constrained structures (e.g.,coplanar sets of points; Lamdan, Schwartz, and Wolson 1988).

In most recognition work the repertoire of primitives has been finite and predetermined. However, achiev-

ing good recognition performance across numerous objects will probably require some ability to learn newtypes of primitives as needed (i.e., feature construction). A preliminary example of such learning has beengiven by Segen (1989), whose system learns primitives that represent useful configurations of distinguishedpoints (see section 2.3.1). Also, whereas objects have usually been described by a relatively small numberof primitives (perhaps a few dozen), recognition under widely varying conditions will require the use ofmuch richer representations that have useful levels of redundancy.

2.1.2 Coordinate system

A scheme for representing object shape may use a coordinate system that is either object-centred (explicitlyrecording 3D structure) or viewer-centred (recording a series of 2D views, or characteristic views). Bothhave been used extensively for recognition. An object-centred representation will describe an object moreaccurately and concisely, but the description cannot be generated easily using intensity images acquired

from unknown viewpoints. Instead, a viewer-centred description can be created by clustering views, as

some researchers have done when creating viewer-centred descriptions from object-centred ones (e.g.,Pathak and Camps 1993). The viewer-centred representations accuracy is enhanced, and its storagerequirements reduced, by using primitives that remain relatively stable over a range of viewpoints.

2.1.3 Organization

A shape representation will often organize primitives according to some combination of part/whole rela-

tions, degrees of descriptive accuracy, and adjacency relations. These organizations are frequently formal-ized as attributed graph structures in which nodes denote primitives, arcs denote their relations, and both

nodes and arcs have attributes that record measurements (e.g., Wong, Lu, and Rioux 1989).

2.1.4 Representing object models

Modeling an object that has a varying or uncertain appearance (e.g., a flexible object) requires some way ofrepresenting the possible range of variation. For articulate objects, one approach has been to decompose theobject into parts whose relative positions are described by free variables having specified ranges (e.g.,Brooks 1981). More general deformations can be modeled by an attributed graph in which attributes are

10

characterized by probability distributions (e.g., Wong and You 1985). In addition, an object model mayinclude information about the reliability of various object features and the expected cost of detecting themin images (e.g., Kuno, Okamoto, and Okada 1991). These factors can be used to evaluate evidence of theobjects presence in an image, and to optimize the strategy for collecting that evidence. It has proven diffi-cult to estimate, from analytic models of lighting, sensor, and surface characteristics, such things as the

reliability and variation of various object features (Chen and Mulgaonkar 1992). Methods that learn objectmodels directly from images seek to bypass that difficulty.

2.2 Recognition

Recognizing an object in an image is generally accomplished in four steps: perceptual organization, inwhich interesting arrangements of features are identified in the image; indexing, in which likely candidates

are selected from a library of models; matching, in which an optimal match is found between features of the

image and those of a model; and verification , in which a decision is made about whether the optimal matchrepresents an instance of the object in the image. These four steps are considered individually in the follow-ing subsections. An important research goal has been to minimize search, which is substantially involved in

each step of the recognition process.

2.2.1 Perceptual organization

This step takes simple image features (e.g., edgels or straight lines) and groups them to form larger featuresthat are more descriptive and, consequently, more useful for indexing. Grouping is based not on knowledge

of specific objects, but rather on general assumptions that are meant to hold for most objects and situationsencounterede.g., the assumption that most objects are composed of a small number of convex parts(Jacobs 1989). The goal is to form groups that are likely to come from a single object, yet unlikely to ariseaccidentally due to a chance arrangement of objects (Lowe 1985). To be useful for recognition, however, agrouping method need only be partially successful since spurious groups will be eliminated by subsequent

stages. With some notable exceptions (Lowe 1985; Sarkar and Boyer 1993), grouping methods are usuallyad hoc, controlled by numerous hand-picked parameters, and entirely bottom-up (e.g., Bergevin and Levine1992).

2.2.2 Indexing

Features found in the image are used to identify likely models, and perhaps also starting points for

matching those models to the image. Frequently used are hash tables, which identify model hypotheses

11

from hashed feature measurements (e.g., Lamdan, Schwartz, and Wolfson 1988). Radial basis functionshave been used to map feature measurements to a series of graded yes/no responses, one per model

(Brunelli and Poggio 1991); they do not appear to provide any advantage over hash tables, however.Hierarchical indices, which organize models according to an abstraction hierarchy, have been studied by

few researchers (e.g., Sengupta and Boyer 1993) but they appear promising for reducing overall search.

2.2.3 Matching

The goal is to find a match between image features and model features; that match must correspond to

some possible view of the modeled object and be, in some sense, optimal. How one measures match qualitydetermines not only what matches will be preferred, but also what match algorithm can be used. The fastest

matching algorithm reported so far, Breuels RAST method (1992), is analogous to a Hough transform inthat it searches a transformation space. RAST uses a relatively simple measure that assumes a rigid object,requires corresponding features to fall within some fixed range of each other, and weights each feature

correspondence equally. However, it may be difficult to extend RAST to handle articulate or flexible

objects, or models whose features vary in importance.

Other methods search the space of possible pairings between model and image features. These correspon-

dence space methods (e.g., Grimson 1990) can easily incorporate a variety of matching constraints andmore powerful match quality measures, but they are time-consumingthey require time exponential in the

number of model or image features. Correspondence space searches can be shortened by using a branch-

and-bound technique to prune the search tree, or by accepting near-optimal solutions. Alignment methods

do the latter and have proven to be quite effective (Ayache and Faugeras 1986; Lowe 1987a, b; Ullman1989). They match the first few features with a correspondence space search, use those matches to estimatethe pose of the object, and then locate additional matches within the constraints provided by the pose esti-mate.

For an efficient correspondence space search, some attention must be given to the order in which features

are matched. At each stage one should choose a feature that is likely to be found in the image when the

object is present, is unlikely to arise by chance, is relatively unique among model features, is well localizedby previous matches, and can effectively constrain the remaining matches. Some have described methods

of quantifying these factors so that features are considered in an optimal order (e.g., Kuno, Okamoto, andOkada 1991). Another method of ordering the search is to structure it hierarchically, first recognizing

12

generic object parts, and then recognizing objects in terms of those parts (Burns and Riseman 1992;Dickinson, Pentland, and Rosenfeld 1992).

2.2.4 Verifying

Once a match has been identified, it must be decided whether that match actually represents an instance of

an object in the image. Often this is done by projecting all model edges into the image and searching forimage edges near them; parallel image edges are counted as positive evidence, and crossing ones, as nega-

tive evidence (e.g., Huttenlocher and Ullman 1990). More formally, one can use an analytic model offeature occurrence to estimate the probability that a particular match is accidental (i.e., due to a conspiracyof accidental features), and accept the match only if this probability is tiny (e.g., Sarachik and Grimson1993). Breuel (1993) has suggested that a model of scene structure be used in evaluating the match to judgewhether missing features can be explained away by a plausible series of occlusions.

2.3 Learning to recognize objects

There are several possible roles for learning in an object recognition system. These are surveyed in thefollowing sections. A final section summarizes efforts to establish theoretical bounds on the learnability of

object recognition.

2.3.1 Learning an appearance representation

An object recognition system may learn new types of shape or appearance primitives to use in describingobjects. Typically this is done by clustering existing primitives or groups of them, and then associating newprimitives with the clusters that have been found. New primitives thus represent particular configurations or

abstractions of existing ones. The new configurations may improve the representations descriptive power,

and the new abstractions may allow more appropriate generalizations.

Segen (1989) has demonstrated this approach with a system that learns a representation for 2D contours.The systems lowest-level primitives are distinguished points, such as curvature extrema, found on contours

in training images. Nearby points are paired, each pair is characterized by a vector of measurements, the

measurement vectors are clustered, and a new primitive is invented for each cluster of significant size.

Consequently, each new primitive describes a commonly observed configuration of two distinguished

points. The induction process is repeated with these new primitives to generate higher-level primitives

describing groups of four, eight, and more distinguished points. One would expect this method to be sensi-

13

tive to clutter in the training images and to parameters of the clustering algorithm; unfortunately, Segen has

not reported on that sensitivity.

Two other examples of representation learning will round out this portion of the survey. Delanoy, Verly,

and Dudgeon (1992) and Fichera et al. (1992) have described object recognition systems that induce fuzzypredicates from training images. Their systems represent models as logical formulae, and, therefore, the

systems need appropriate predicates. These are invented by clustering measurements of low-level primi-

tives that have been recovered from training images. Turk and Pentland (1991) and Murase and Nayar(1993) have induced features using principle component analysis. They compute the several most signifi-cant eigenvectors, or principle components, of the set of training images. Since these few eigenvectors span

much of the subspace containing the training images, they can be used to concisely describe those images

and others like them.

2.3.2 Learning an appearance classifier

The object recognition task can be characterized, in part, as a classification problem: instances representedas vectors, sets, or structures must be classified into categories corresponding to various objects. A wealthof techniques has been developed for classifying, and for inducing classifiers from training examples. For

the purposes of object recognition, the important considerations distinguishing these techniques include theexpressiveness and complexity of the input representation (e.g., vectors are easier to classify than struc-tures), the generality of the categories learned, the ability to cope with noisy features, the number of train-ing examples needed, and the sensitivity to the order in which examples are presented.

Jain and Hoffman (1988) describe a system that learns rules for classifying objects in range images. Theinstances classified by their system are sets of shape primitives with associated measurements. The classi-

fier applies a series of rules, each contributing evidence for or against various classifications. Each rule

applies to a particular type of shape primitive and a particular range of measurements for that primitive.

These rules are learned from training images by extracting primitives from the images, clustering them

according to their measurements, and associating rules with the clusters that derive primarily from a single

object. Because this system does not learn constraints governing the relative positions of the shape primi-tives, it appears to have a very limited ability to distinguish among objects that have different arrangementsof similar features.

14

Neural networks, including radial basis function networks, have been used as trainable classifiers for objectrecognition (e.g., Brunelli and Poggio 1991). In this role, the network approximates a function that maps avector of feature measurements to an object identifier, or to a vector of graded yes/no responses, one perobject. For this approach to succeed the function must be smooth; furthermore, as with any classifier, theobject categories must be shaped appropriately in feature space (e.g., Gaussian radial basis functions arebest suited for representing hyperellipsoids). Nearest neighbour classifiers, which make fewer assumptions,are also commonly used. Rare are comparative evaluations of how various classifier/feature space combi-

nations perform on object recognition problems (such as Brunelli and Poggio 1993).

2.3.3 Learning a structural model

A structural model explicitly represents both shape primitives and their spatial relationships. Its structure is

thus analogous to that of the modeled object, and it is often represented as a graph or, equivalently, as aseries of predicates. In general, a structural model is learned from training images by first obtaining struc-

tural descriptions from each image, and then inducing a generalization covering those descriptions.

Connell and Brady (1987) have described a system that learns structural models for recognizing 2D objectsin intensity images. The system incorporates many interesting ideas. They use graphs to represent the

part/whole and adjacency relations among object regions that are described by smoothed local symmetries(ribbon shapes). An attribute of a region, such as its elongation or curvature, is encoded symbolically by thepresence or absence of additional graph nodes according to a Gray code. A structural learning procedure

forms a model graph from multiple example graphs, most commonly by deleting any nodes not shared by

all graphs (the well-known dropping rule for generalization). Similarity between two graphs is measured bya purely syntactic measure: simply by counting the nodes they share. Consequently, this system accords

equal importance to all features, and it uses a somewhat arbitrary metric for comparing attribute values.

The approaches described below involve more powerful models that represent probability distributions over

graphs. A Markov or independence condition is assumed so that the high-order, joint probability distribu-tion can be approximated by a product of low-order distributions, one per node or arc. The probability of a

particular graph instance is then defined according to a partial match between that instance and the model.

Wong and You (1985) represent a model as a random graph in which nodes represent shape primitives,arcs and hyperarcs represent relations among them, and both have attribute values characterized by discrete

probability distributions. An attributed graph (i.e., a random graphs outcome) is treated as just a special

15

case of random graph. An entropy measure is defined on random graphs, and the distance between two

random graphs is defined as the increment in entropy that would result from merging the two (the minimalincrement over all possible mergings). A random graph is synthesized from examples by repeatedly merg-ing the two nearest graphs. This learning method seems to have been demonstrated only with 2D recogni-

tion problems and clean, synthetic images. However, McArthur (1991) has extended the random graphformalism to allow continuous attributes, and he has used it to learn 3D, object-centred models from imageswith known viewpoints.

In Segens system (1989) for recognizing 2D shapes, a graph node represents a relation among shape primi-tives, which are oriented point features (see section 2.3.1). But instead of specifying a particular relation, anode in the model graph specifies a probability distribution over possible relations. For example, relations

involving two features may include one that holds when the features are oriented away from each other, and

another that holds when they are oriented in the same direction; a model node may specify that the first

relation holds with probability 0.5, and the second, with probability 0.3. When a graph instance is matched

with the model, the instances probability can be computed by assessing the probability of each instance

node according to the distribution of its corresponding model node. Recognition involves finding the most

probable match. A model is learned incrementally from instances by matching it with each one succes-

sively; following each match the probabilities recorded in matching model nodes are adjusted and anyunmatched instance nodes are added to the model.

2.3.4 Learning a set of characteristic views

In learning a model that is to be represented as a set of characteristic views, part of the task is to choose

those views. One can cluster the training images (a form of unsupervised learning) to create one characteris-tic view from each cluster. Gros (1993) and Seibert and Waxman (1992) have done this with real images,while several others have clustered images rendered from CAD models. Gros measures the similarity of an

image pair as the proportion of matching shape primitives. Clusters are formed by successively merging

nearby images, and then a view-specific model is created from each cluster. Seibert and Waxman use a

comparable method; since they are interested in recognizing aircraft in flight, their system also learns a

Markov model characterizing transitions between characteristic views.

16

2.3.5 Learning a recognition strategy

Draper (1993) has considered how a system equipped with a variety of special-purpose representations andalgorithms might learn strategies for employing those techniques to recognize specific objects. A typicalrecognition task would be to locate a tree by fitting a parabola to the top of its crown.1 For this task, an

appropriate strategy is to segment the image, extract regions that are colored and textured like foliage,

group these into larger regions, smooth region boundaries, and fit parabolas to the boundaries. A human

supplies training examples by pointing out the desired parabolas in a series of images. The system then

evaluates various strategy choices (e.g., whether to smooth region boundaries) and parameter choices (e.g.,the degree of smoothing) for cost and effectiveness. From these evaluations it constructs a strategy, includ-ing alternatives, for performing the task on new images. By learning what order to try alternative recogni-

tion methods, Drapers system differs from those that just select a single set of features useful for recogni-tion.

2.3.6 Theoretical learnability

There have been some efforts to identify theoretical limits on what can be learned from images. These

efforts are based on the analogy that learning to recognize an object from images is like to learning aconcept from examples. Valiant (1984) has provided a useful definition for characterizing the class ofconcepts that can be learned by a particular algorithm: Informally, a class is probably approximately

correct (PAC) learnable by an algorithm if, with high probability, the algorithm learns a concept thatcorrectly classifies a high proportion of examples using polynomially bounded resources (and, conse-quently, number of training examples); the bound is a polynomial of both the accuracy and some naturalparameter of the concept class (e.g., vector length for concepts defined on a vector space). Significantly, thealgorithm has no prior knowledge about the distribution of examples.

Shvayster (1990) has shown that some classes of concepts defined on binary images are not PAC-learnablefrom positive examples by any algorithm. For example, suppose a template is said to match an image when

every black pixel in the template corresponds to a black pixel in the image (though not necessarily viceversa). Then the concept consisting of all instances not matching some unknown template is not PAC-

1 This example is from Draper 1993, pp. 93119.

17

learnable. Shvayster speculates that some nonlearnable concepts may become learnable, however, if some

prior knowledge about the distribution of examples is available.

Edelman (1993) has argued that Shvaysters negative result is not applicable to object recognition becauseit uses an instance representation, the binary image, that is inappropriate. If instead instances are repre-

sented by vectors of point feature locations, Edelman shows, then recognition of an object can be learnedfrom a polynomial number of positive examples. He concludes that model learning may be practical, pro-

vided an appropriate representation is chosen.

18

3. Approach

This chapter describes an approach to the design of a system that learns to recognize objects. The discus-sion is organized around four central issues, with one section devoted to each. The four issues are:

How is an image represented?

How is a model represented?

How is recognition accomplished?

How are models learned from images?

A fifth issue, how the database of models is organized and accessed, must also be considered in designing a

complete object recognition system that is meant to work with a large repertoire of objects. To limit itsscope, however, the proposed research will not study this issue in any depth. Instead, the simplest possible

organization and access methods will be used for the database of models: a list of models will be kept, and,

during recognition, each model will be tested in sequence. In future, this aspect of the system could be

improved readily using existing techniques for managing model databases. For example, efficient access to

a large database could be maintained by constructing hash table indices, factoring out subparts common to

multiple models, and clustering similar models.

The proposed approach to recognition and learning may be summarized as follows: A model is organized

as a set of characteristic views, each representing a probability distribution over collections of local appear-

ance features. An image is represented using similar appearance features. Recognizing an object in animage requires a search for the most probable correspondence between model features and image features.

When learning a model, the system uses a clustering algorithm to divide training images among character-

istic views; within each view cluster, training images are consolidated into a single description by matching

and generalizing them. Features found to correspond in training images provide the samples from which the

probability distributions are estimated.

Some aspects of this approach have already been implemented and tested, as reported in chapter 4. The

approach described here incorporates lessons learned from that experience.

19

3.1 Image representation

This section begins with an explanation of the general concepts underlying the proposed image representa-

tion. This is followed by more specific information on the representation to be implemented, and the

method by which descriptions in that representation are derived from images.

3.1.1 Concepts

The system represents an image in terms of specific properties, which we call features.2 Each feature has aparticular type and location within the image. It may, for example, be a segment of intensity edge, a particu-

lar arrangement of such segments, a region of uniform texture, or one of uniform colour. Some low-level

features may be defined as being the responses of feature detectors, such as edge or corner detectors; other,

higher-level features may be defined as certain groupings or abstractions of the low-level ones. A typical

image will be described by numerous features of various types.

The design of features should not be critical to the performance of the system; what is important is that

there be a wide range of features able to provide a rich, somewhat redundant description of any image.

Good features are those that can be detected reliably, and are relatively invariant with respect to unimpor-

tant changes (for our purposes, those are modest changes in lighting or viewpoint). For efficiency in recog-nition it is useful to have some highly-selective features that usually occur only in the presence of certain

objects. For example, in recognizing manufactured objects, features denoting various geometric arrange-ments of intensity edges may serve this role well. Some more commonplace features, such as simple edge

fragments, should supplement the highly-selective ones so that the overall repertoire can still describe a

wide variety of objects, at least at a basic level. Of course, distinctions can only be made if the repertoireincludes features that respond to those distinctions.

A token denotes a feature that has been identified in an image. It records the type of feature, its position

within the image, and a vector of numeric attributes that help to characterize the feature. An L junctionfeature, for example, may have one attribute for the junctions angle and another for the relative lengths of

2 Throughout this chapter, the proposed system is described using the present tense rather than the futuretense. This is for readability, and to allow the material to be re-used readily in other texts, such as athesis.

20

its two segments. Attributes are expressed so that they are invariant with respect to translation, rotation, and

scaling of the feature within the imageusing, for example, scale-normalized measures (Saund 1992).

A features position is represented by a point location in the image plane, an orientation, and a scale, all

measured according to some convention appropriate for that type of feature. For example, an L junctionslocation may be recorded as the location of its vertex, its orientation may be that of its bisector, and its scale

may be the diameter of the smallest circle enclosing the segments forming the junction. A feature whoseorientation is ambiguous requires special treatment. If there are just a few different ways of assigning anorientation to the feature then it is represented by multiple tokens, one per orientation. A straight line seg-

ment, for example, is represented by two tokens with opposite orientations. (This duplication simplifiestoken operations such as comparison and grouping.) If, like a circle, the feature has no well-defined orienta-tion, its tokens orientation is left undefined.

Tokens denoting the features found in an image are organized in a graph structure, which we call an image

graph . Graph nodes represent tokens, and directed arcs represent composition and abstraction, as when one

feature is found by grouping or abstracting others.

Formally, an image graph G is denoted by a tuple F, R , where F is a set of image feature tokens and R is arelation over elements of F. An image feature token fk F is a tuple tk, ak, bk, where tk is the featurestype, ak is its attribute vector, and bk is its position. The domain of a features attribute vector depends on

the features type. Section 3.3, below, describes how positions such as bk are represented. Finally, a relation

in R is a tuple k, l1, , lm, indicating that the image feature k was found by grouping or abstracting imagefeatures l1 through lm.

It is important to note that the image graph deliberately retains information that may be redundant. Some

researchers, having other purposes in mind, use procedures that retain only the best of competing descrip-

tions, or that discard tokens once they have been grouped to form a new token. Here we retain all tokens

because doing so makes the overall representation both more expressive and more stable. Consider, for

example, an edge segment in one image that yields both a line token and an arc token. Together the two

tokens provide more information about the segment than either does alone. When a second image is

acquired from a slightly different viewpoint the same segment may be perceived as being slightly more

curved, and it no longer yields the line. A third image may produce the line, but not the arc. But since the

first image graph included both the line and arc tokens it differs only partially from either the second or

21

third graph. As this example illustrates, we achieve a more stable representation by permitting some redun-

dancy.

3.1.2 Implementation

For a first implementation of the system we have chosen to use features based solely on intensity edges,

which provide some degree of illumination and viewpoint invariance. Table 3.1 summarizes these features

and their attributes.

Table 3.1. Types of features used to describe image content.

Feature Type Attributes Position a

edgel (none) LOb

straight line segment (none) LOb

elliptical arc segment angular extenteccentricity

LOS

line/line junction corner angle LO

arc/arc junction corner angleratio of arc radii

LOS

line/arc junction corner anglescalenormalized arc radius

LOS

junction pair with common line sum of corner anglesdifference of corner angles

LOS

junction triple with common lines angle of central cornerratio of two interior sides

LOS

parallel lines angle between lines LOS

parallel arcs angle between arcsangle spanned by arcsscalenormalized average separation

LOS

closed convex region compactnesseccentricity

LOcS

ribbon scalenormalized average width LOS

a Which feature position components are well-defined. L: location; O: orientation; S: scale.b Orientation has a two-way ambiguity.c Orientation is undefined if eccentricity is low.

22

The most basic features are edgelsshort, fixed-length strands of connected edge pixels that have been

identified by an edge detection operator and thinned to single-pixel width. Connected chains of edgels are

segmented into straight lines and elliptical arcs.3 The segments are then grouped successively, and in vari-

ous ways, to form higher-level features. This initial repertoire of features and feature attributes could be

expanded, as needed, to allow richer descriptions of appearance.

The system uses a straightforward grouping method that is entirely bottom-up (i.e., data-driven) and con-trolled by numerous thresholds and other parameters. Table 3.1 indicates the order in which features are

identified. Junctions, for example, are found by exhaustively testing all pairs of segments, measuring each

pairs geometry against several thresholds. Regions and ribbons are found by identifying in the image graph

particular configurations of low-level features, such as junctions.

Figure 3.1 shows the features found in one image. As this example demonstrates, the method is far from

perfect. Compared to human vision it seems to miss some features and hallucinate others. However, any

grouping method is bound to have these faults to some degree. It remains the task of the matching and

learning procedures to accommodate any uncertainty inherent in the grouping process. Although more

sophisticated grouping methods have been developed recently (e.g., Sarkar and Boyer 1993), this simplemethod, with some minor improvements, appears adequate for the experiments proposed here.

3.2 Model representation

3.2.1 Use of characteristic views

To fully and accurately describe the range of possible variation in an objects appearance, a model is orga-nized on two levels. Significant variations in appearance are handled by subdividing the model into a set of

characteristic views; because these are independent each can describe an entirely different aspect of the

object. Within the characteristic views, smaller variations can be represented since each characteristic viewdefines a probability distribution over a variety of similar image graphs.

3 This repertoire of straight lines and elliptical arcs may prove inadequate for representing and recogniz-ing a useful variety of test objects. In that case, the repertoire should be extended by one or two addi-tional types of parameterized curve. Leonardis (1993) shows how to recover straight, parabolic, andelliptical arcs from edge data.

23

(a) (b)

(c) (d)

(e) (f)

(g) (h)

Figure 3.1. An illustration of the features used to describe an image. The 365 512 pixel image (a)yielded 280 edge segments (b), 200 circular arcs and straight lines (c), 248 junctions (d),409 junction pairs (e), 75 parallel line and arc pairs (f), 21 regions (g), and 14 ribbons (h).

24

This is a viewer-centred model, and it is preferred over an object-centred one for several reasons. Mostimportantly, the viewer-centred model is more easily learned. To learn an object-centred model, one has torecover the 3D location of each feature. That is difficult because the viewpoint of each training image is

unknown, there is uncertainty in the measurement of each features image location, and there may be errors

in matching features among images. To learn a viewer-centred model it is not necessary to recover 3D

structure, and difficulties like unknown viewpoints or appreciable measurement uncertainty are more easily

managed.

Each characteristic view describes a range of possible appearances by defining a joint probability distribu-tion over image graphs. Because the space of image graphs is enormous, however, it is not practical to

represent or learn this distribution in its most general form. Instead, the joint distribution is approximated:the various features making up the characteristic view are treated as independent so that the joint distribu-tion can be decomposed into a product of marginal distributions, one per feature.4 As we shall see, this

approximation greatly simplifies the representation, matching, and learning of models.

One consequence of the simplifying approximation is that statistical dependence (association or covariance)among model features cannot be accurately represented in a single characteristic view. We might consider,

as an extreme example of such dependence, an object for which two subsets of features are mutually exclu-siveonly one of the two subsets appears in any image. This association among features cannot be repre-

sented by a single characteristic view, with its independent features; it can, however, be represented per-

fectly by two characteristic views, each containing one of the two subsets.

In fact, by using a large enough set of characteristic views we can model any object as accurately as wemight wish. However, for the sake of efficiency we would prefer to use relatively few views and let each

represent a moderate range of possible appearances. Part of the challenge of learning a model, then, is to

strike an appropriate balance between the number of characteristic views used, and the accuracy of those

views over their ranges. This issue will be revisited when we discuss the model learning procedure in

section 3.5.

4 Similar feature independence assumptions are often used to simplify other computer vision problems.

25

3.2.2 Representation of a characteristic view

A single characteristic view is described by a model graph (figure 3.2). Like an image graph, a model graphhas nodes that represent features and arcs that represent composition and abstraction relations among fea-

tures. Each node records the information needed to estimate three probabilities:

The probability of observing this feature . Or, more precisely, the probability of observing it in animage depicting the characteristic view of the object. The node records the number of times that itsfeature has been identified in training images, and a count is also kept of all training images used to

learn the overall model graph. The probability is estimated from these two numbers.

Given that this feature is observed, the probability of it having a particular set of attribute values. Thisis characterized by a probability density function since attribute values are continuous. Prior to learn-

ing, little can be assumed about the form of this density function for any particular feature; it may

depend on many factors such as possible deformations of the object, variations in lighting, and effectsthat uncertainty in measuring low-level features may have on the attributes of higher-level ones.

Consequently, the system uses a non-parametric density estimator that makes relatively few assump-

tions. The estimator, which is described in section 3.3.2, requires samples drawn at random according

to the distribution to be estimated. These samples are attribute vectors that have been acquired from

training images and are recorded by the model graph node. As samples from the unknown distribution

accumulate, the learning procedure may discover that the distribution can be approximated well by a

simple, parametric form (e.g., a multinormal distribution). In that case, an optimization is performed:the sequence of attribute vectors is replaced by a handful of distribution parameters from which proba-

bility density estimates can be computed more quickly.

Given that this feature is observed, the probability of it having a particular position . This is also char-acterized by a probability density function and estimated from a sequence of samples obtained from

training images. The positions of all features in a single model graph are represented using a common,

arbitrarily-fixed coordinate system. Each features position is modeled as though it were independently

distributed with respect to that coordinate system. We approximate this distribution using a normal

distribution, with mean and covariance parameters estimated from the sequence of position samples.

26

model graph

no. of feature observations

sequence of attribute vectors

sequence of positions

node node node

node

node node node node node node

nodenodenode

object model

type

Figure 3.2. Schematic illustration of model representation. An object is modeled by one or moremodel graphs, each describing a set of features and their relations. A feature is represented by anode recording information supporting probability estimates: the probability of the feature beingobserved, of it having particular attribute values, and of it having a particular position relative toother features.

Formally, a model graph G is denoted by a tuple F , R , m , where F is a set of model feature tokens, Ris a relation over elements of F , and m is the number of training images used to produce G . A model

feature token f j F is a tuple t j , m j , Aj , Bj , where t j is the features type, m j is the number of trainingimages in which the feature was found, and Aj and Bj are the sequences of attribute vectors and positions

drawn from those training images. Finally, a relation in R is a tuple j , l1, , lm, indicating that modelfeature j is a grouping or abstraction of model features l1 through lm.

3.3 Coordinate systems

A features position is specified by a location, orientation, and scale expressed in terms of some 2D, carte-

sian coordinate system. To locate image features, we use an image coordinate system whose frame is the

cameras focal plane. To locate model features, we use a single, arbitrarily-fixed model coordinate system

for all features within a model graph.

Part of the task of matching a model with an image is to determine a viewpoint transformation that brings

the image and model features into close correspondence. In our case, this viewpoint transformation is a 2D

similarity transformation from image coordinates to model coordinates.

By choosing carefully the manner in which positions and transformations are represented we can ensure

that the transformation may be expressed as a linear operation, thus simplifying the matching process. One

formulation that achieves this represents the transformation as a vector. (This formulation has been used by

27

Ayache and Faugeras, 1986, among others.) A transformation, T, that is equivalent to a rotation by , ascaling by r , and a translation by [tx ty ]

T is represented by the vector t = [tx ty tu tv ]

T, where

= arctan(tv tu ) and r = tu2 + tv2 . Similarly, a feature with orientation , scale s , and location [x y]T isrepresented by the vector b = [x y u v], where = arctan(v u) and s = u2 + v2 . The transformation of theposition vector b , written T(b), is obtained by pre-multiplying t with a matrix derived from b:

b =

x

y

u

v

= T(b) =

1 0 x y

0 1 y x

0 0 u v

0 0 v u

tx

ty

tu

tv

This formulation allows a transformation to be estimated easily from a set of feature correspondences.

Given a pair of image and model features with positions b and b , the transformation aligning the two

features can be obtained as the solution to a set of four linear equations given by b = T(b). With additionalfeature pairs the problem of estimating a transformation becomes over-constrained, and a solution that is

optimal in the least-squares sense can be found directly through least-squares estimation as described in

section 3.4.3.

3.4 Recognition procedure

To identify an instance of an object in an image the system must find a match between a model graph andan image graph. That match will likely be a partial one, with some image features not explained by the

model (perhaps because there are other objects in the scene) and some model features not found in theimage (perhaps because of shadows or occlusion). However, the partial match should also be optimal insome sense, maximizing both the number of nodes matched and the quality or closeness of those matches.

Those two factors are evaluated by a match quality measure, which is described below in section 3.4.1. The

measure relies on estimates of probability density, some of which are supplied by the estimator described in

section 3.4.2. In section 3.4.3, the procedure for finding a match between a model graph and an image

graph is described. Once found, a match is verified as described in section 3.4.4. To improve the presenta-

tion, model graph nodes and image graph nodes are often simply referred to as model and image features.

3.4.1 Match quality measure

The matching procedure is guided by a match quality measure that considers several factors, including

which features match, how similar their attribute values are, how closely their positions correspond, and

28

how uncommon the features are. Each factor is weighted according to past matching experience as recorded

by the model. The measure combines empirical statistics using the rules of Bayesian probability theory to

judge the likelihood that a given pairing of model features with image features corresponds to an actualinstance of the object in the image.

A matching between image and model features is described by a sequence of tuples, E = ei{ } . Each tuple iseither of the form j,k, meaning that model feature j matches image feature k, or of the form j, , meaningthat model feature j matches nothing. A complete matching is one that contains a least one tuple for eachmodel feature.

Let H denote the hypothesis that an object represented by model graph G is present in some image. Weseek a match E and a viewpoint transformation T that maximize P(H | E, T), which can be written

P(H | E, T ) = P(E T | H)P(E T )

P(H) = P(E | T , H) P(T | H)P(E T )

P(H) . (3.1)

There is no practical way to represent in full generality the high-dimensional, joint probability density func-tions P(E | T , H) and P(E T ). Instead, we must approximate them using products of low-dimensional,marginal probability density functions. The approximation allows us to rewrite equation 3.1 as

P(H | E, T ) P(ei | T , H)P(ei )i

P(T | H)P(T ) P(H) . (3.2)

The approximation would be entirely accurate if the following two independence properties were to hold:

(a) ei{ } is collectively independent given knowledge of T and H(b) ei ,T{ } is collectively independent in the absence of any knowledge of H

However, in practice we can expect these independence properties to hold only somewhat. Given that an

object is present at a particular pose, features detected at widely separate locations on the object may belargely independent, thus preserving property (a). And in a random scene cluttered with unknown objects,even nearby features may be largely independent, thus preserving property (b). The properties fail to hold,however, whenever there is redundancy among features. For example, a collection of nearby features is not

independent from some additional feature that represents a noteworthy grouping of them; in such cases,

equation 3.2 will overrate the significance of matching all of the features. Nevertheless, with redundancy

present throughout a model graph, this bias may have a negligible effect on the outcome of any particular

29

matching problem. Our research strategy is pragmatic: we have adopted equation 3.2 as a first approxima-

tion, and will replace it with a more careful approximation if that proves necessary.

Equation 3.2, then, forms the basis of our match quality measure. We assume that all viewpoint transfor-

mations are equally likely a priori, and thus P(T) is a constant. P(H) is the prior probability that an object asdepicted by G is present in an image; it can be estimated from the proportion of training images that

matched G and were used to create G . P(T | H) can be taken as identical to P(T), or it can be estimatedfrom past training images by keeping some record of the transformations obtained when matching those

images.

The remaining terms of equation 3.2 are the prior and conditional probabilities of various feature matches.

Before defining those terms, we must introduce notation for some of the specific events within the universe

of matching outcomes.

e j = k is the event that model feature j matches image feature ke j = is the event that model feature j matches nothinga j = a is the event that model feature j matches a feature whose attributes are ab j = b is the event that model feature j matches a feature at position b (in model coordinates)

Conditional probability of a feature match

For the conditional probabilities, P(ei | T , H), there are two cases to consider.(a) P(e j = k | T , H) is the probability of model feature j matching image feature k in an image of G

with viewpoint transformation T. Our estimate considers how often j has matched during training,and what attributes and positions its matching features have had.

P(e j = k | T , H) = P(e j a j = ak b j = T(bk ) | T , H)= P(a j = ak | b j = T(bk ), e j , T , H)

P( b j = T(bk ) | e j , T , H) P(e j | T , H) .By adopting some additional independence assumptions (e.g., that b j gives no information abouta j ), we obtain

P(e j = k | T , H) = P(a j = ak | e j , H) P( b j = T(bk ) | e j , T , H) P(e j | H) .The P(a j) and P( b j) terms are estimated using the sequences of attribute vectors and posi-tions recorded with model feature j. The estimate of P(a j) uses the non-parametric density

30

estimator described in the next section, while the estimate of P( b j) assumes a normal distribu-tion:

P( b j = b | e j , T , H) = G4 (b j y j , C j ) ,

where Gd (x, C) =1

(2pi )d 2 (det C)1 2e

xC1x

2

and yj and C j are the mean vector and covariance matrix for B j , the sequence of positions

recorded with model feature j.Finally, P(e j | H) is computed according to the frequency with which model feature j wasmatched among training images. We use a Bayesian estimator with uniform prior:

P(e j | H) = (m j + 1) (m + 2) .

(3.3)

(Recall that m is the number of training images that have been incorporated into G , and m j is thenumber of those in which model feature j matched some image feature.)

(b) P(e j = | T , H) is the probability of model feature j remaining unmatched in an image of G withviewpoint transformation T. It is just 1- P(e j | T , H), where P(e j | T , H) = P(e j | H)under an independence assumption stated earlier, and P(e j | H) was defined by equation 3.3.

Prior probability of a feature match

Estimates of the prior probabilities, P(ei ) , are based on measurements of a large, random collection ofimages typical of those in which G will be sought. In this milieu collection G occurs randomly along with

other views and other objects. The milieu is used to estimate the probability of a feature match occurringregardless of whether G is present. By an analysis similar to that underlying estimates of the conditional

probabilities, we obtain estimates for the two cases of e i:

(a) P(e j = k) = P(a j = ak | e j ) P( b j = bk | e j ) P(e j ) . (3.4)(b) P(e j = ) = 1 P(e j ) .

The P(a j) term in equation 3.4 is estimated using a non-parametric density estimator and samples ofattribute vectors drawn from the milieu collection. The P( b j) term is estimated by assuming that featurepositions are uniformly distributed throughout a bounded region of the model coordinate system. Finally,

P(e j ) is estimated from the frequency with which features of the same type as j occur in the milieucollection.

31

3.4.2 Probability density estimation

To evaluate the match quality measure we have to estimate the probability that a features has a particular

attribute vector. To help us make this estimate, we have samples of attribute vectors that have been

acquired by observing the feature in training images. The estimation problem is therefore of the following

form: given a sequence of d-dimensional vectors x i :1 i n{ } drawn at random from an unknown distri-bution, estimate the probability of a vector x being drawn from the same distribution.

This could be solved by assuming that the distribution is of some parameterized form (e.g., normal) andusing the x i{ } to estimate those parameters (e.g., the mean and variance). However, the attribute vectordistributions may be complex since they depend not only on sensor noise and measurement errors, but also

on systematic variations in object shape, lighting, and pose. Therefore we have chosen a nonparametricestimation method described in detail by Silverman (1986). In its simplest, form, this method estimatesprobability density5 by summing contributions from a series of overlapping kernels:

f (x) = 1n hd

Kx x i

h

i , (3.5)

where h is a constant smoothing factor, and K is a kernel function. We use the Epanechnikov kernel

because it has finite support and can be computed quickly; it is defined as

K x( ) =12 cd

1 d + 2( ) 1 xTx( ) if xTx < 10 otherwise,

where cd is the volume of a ddimensional sphere of unit radius. The smoothing factor h appearing in equa-

tion 3.5 strikes a balance between the smoothness of the estimated distribution and its fidelity to the obser-

vations, x i{ } . We adjust h using a locallyadaptive method: with f as a first density estimator, we create asecond estimator, f a , whose smoothing factor varies according to the first estimators density estimate:

f a (x) =1

n hd i

d Kx x i

h i

i ,

5 Because the attribute values are continuous, the estimates are of probability densities rather than masses.However, the problem of integrating these densities to obtain masses is neatly avoided since the matchquality measure only requires a ratio of two mass estimates; we can compute this ratio as a ratio of twodensity estimates.

32

where i =f (x i )

12

and = f (x i )i

1n

.

The various i incorporate the first density estimates at the points xi, and is a normalizing factor. This

adaptive estimator smoothes more in low-density regions than in high-density ones. Thus a sparse outlier is

thoroughly smoothed while a central peak is accurately represented in the estimate. Figure 3.3 shows an

example from this estimator.

3.4.3 Matching procedure

In matching model and image graphs, we seek a match E and a viewpoint transformation T that maximize

P(H | E, T ) as defined by equation 3.2. This problem does not appear to have a solution method that is bothfast and exact. One could, for example, enumerate all possible E and, for each one, directly solve for an

optimal T . This approach is guaranteed to find the truly optimal solution, but because its complexity is

roughly exponential in the number of model or image features, it is impractical. To considerably shorten the

search for a solution we use two simplifications:

(a) The influence of P(T | H) is ignored during the search. Instead, all viewpoint transformations areconsidered to be equally likely.

(b) We use an iterative alignment method that alternates between extending E and refining T .

7580

8590

95 7580

8590

95

0

0.1

0.2

0.3

0.4

0.5

Angle of junction 1Angle of junction 2

Probability density

Figure 3.3. Example of a locally-adaptive probability density estimate for attribute vectors. Thespikes denote the samples from which the density function was estimated.

33

The iterative alignment method was used by Lowe (1987a, b) in his SCERPO system. It works as follows. Aminimal set of feature matches is first hypothesized and used to estimate a viewpoint transformation. This

transformation is then used to predict the positions of the remaining model features. For each of these pro-

jected model features, potential matches with nearby image features are identified and ranked according tolikelihood (how well the two features match) and ambiguity (whether alternative matches appear nearly asgood). The best ranked matches are then adopted and used to refine the estimate of the viewpoint transfor-mation. The process is repeated until acceptable matches have been found for as many of the model features

as possible. Although backtracking can be used to try alternate matches for projected features, Lowe reportsthat this is seldom necessary because the ranking scheme is usually successful in eliminating ambiguous

matches, and because errors in the initial estimate of the viewpoint transformation are eliminated as

additional matches are incorporated.

In adopting the two simplifications listed above, we are no longer assured that the match solution found

will be a truly optimal one. However, except in rare circumstances (as when P(T | H) is especially non-uniform), we expect to find a solution that is very close to optimal as Lowe has found that the iterativealignment method is accurate enough for reliable recognition.

Ranking feature matches

An initial estimate of the viewpoint transformation can be obtained once one or two feature matches have

been hypothesized. However, possibilities for these initial feature matches must be evaluated without

regard to a viewpoint transformation or the positions of the features. We use the following function to

evaluate a potential match between model feature j and image feature k , using the terms developed insection 3.4.1:

g j (k) =P(e j = k | H)

P(e j = k)=

P(a j = ak | e j , H)P(a j = ak | e j )

P(e j | H)P(e j )

.

Subsequent feature match possibilities are evaluated according to the partial correspondence E and

estimated viewpoint transformation T obtained from earlier matches. The function for evaluating these

potential matches is

g j (k; E, T ) =P(e j = k | T , H)

P(e j = k)if j, k is consistent with E

0 otherwise.

34

For a feature match to receive consideration, it must be consistent with all of the matches adopted up to that

point. In general this means that the feature must not already have been matched and any relations shared

with features that are matched must be consistent. However, to cope with some common shortcomings in

feature detection, these restrictions are relaxed for certain types of features. For example, a models line

segment feature is permitted to match multiple, collinear image segments so that a failure to detect a few

edge pixels will be of minimal consequence.

Estimating the viewpoint transformation

During the match search, we need to estimate a viewpoint transformation from a set of feature matches. In

our formulation, the optimal viewpoint transformation consistent with a set of feature matches can be found

as the solution to a weighted least-squares estimation problem. We find a transformation vector t that

minimizes W At b( ) 2 , where the matrix A represents the positions of the matched image features inimage coordinates, the vector b represents the positions of the matched model features in model coordi-

nates, and the matrix W contains weights based on the covariances of model feature position coordinates.

These objects are defined by

A =

1 0 x1 y10 1 y1 x10 0 u1 v10 0 v1 u1

M

1 0 xn yn0 1 yn xn0 0 un vn0 0 vn un

, b =

b1

M

bn

, and C = WTW =

1 0

O

0 n

1

,

where i is the full 44 covariance matrix for the sequence of four-element position vectors, B j , recorded

with the model feature participating in the ith match, ei = j, k . (When early in the learning process B jcontains only a single position vector, i is set to some reasonable default.) The form of C accords withour assumption that features positions are independently distributed with respect to the object. The solutionto the least-squares estimation problem, t , and its covariance P are given by

ATCA t = ATC b and P = ATCA( )1 .

35

During the match search, a recursive least-squares estimation procedure can be used to update t at minimal

cost each time a new feature match is added to E .

3.4.4 Verification procedure

Once an optimal match has been found between a model graph and an image graph, it must be decided

whether the match represents an instance of the modeled object in the image. At this time it is difficult tosay how sophisticated this verification procedure must be if it is to be reliable enough for learning and

recognition. Our proposal is to begin with a simple approach and elaborate it only as necessary. That simple

approach, and some of the ways in which it may be enhanced, are outlined in this section.

The match quality measure used to guide matching provides an estimate of the probability that the match

result is correct. A simple way to verify the match, then, is to require that this probability exceeds some

threshold whose value is a system parameter. The parameter essentially determines the degree of caution

the system is to exercise in declaring recognition results; it can be set to the ratio of the costs of Type II and

Type I decision errors to produce decisions that minimize the expected error cost.

However, the match quality measure may prove unsuitable for making verification decisions because it may

differ too widely among different objects according to what high-level features they have. As explained insection 3.4.1, high-level features that represent groupings of low-level ones violate the feature indepen-

dence assumption. Consequently, the match quality measure is biased by an amount that depends on what

high-level feature are present in the model. Whereas this bias may have little effect on the outcome of

matching any one model graph, it could make it difficult to compare matches among different model

graphs. If this is the case, it may be better to consider only the lowest-level features, edgels, when using the

measure as a probability estimate for a verification decision.

To recognize more than one object in a scene we must accept more than one match for a given image graph.This will entail choosing among competing matches that provide alternate interpretations for some of the

same image features. However, care must be used when determining whether two matchs interpretations

conflict over a featuredifferent objects may be permitted to share an edge segment, for example, but not afeature denoting a closed group of edge segments. Once it has been determined which pairs of match inter-

pretations can coexist, we can obtain a complete interpretation of the scene by selecting a subset of matches

that is both pairwise consistent and maximizes the overall match quality measure. An iterative, greedy

algorithm should be adequate for choosing that subset in all but the most elaborate and confusing scenes.

36

3.5 Model learning procedure

With each new training image, a models description is updated by a model learning procedure. This proce-

dure comprises three activities:

clustering divides training images among clusters corresponding to characteristic views

generalizing combines the training images of a single cluster to form a model graph

consolidating simplifies a model graph while retaining essential information

When a new training image is received the learning procedure must decide whether to assign it to one of the

existing clusters, or whether to form a new cluster with it. That decision is made by a clustering algorithm,

which is described below in section 3.5.1. When a training image is assigned to an existing cluster, the gen-

eralizing algorithm described in section 3.5.2 merges information from the image into the model graph rep-

resenting the cluster. Additionally, as a model graph grows in complexity, it is consolidated as described in

3.5.3 to maintain efficiency.

3.5.1 Clustering training images into characteristic views

An incremental conceptual clustering algorithm is used to create clusters among the training images as they

are acquired. Like COBWEB for example (Fisher 1987), the algorithm requires a measure of global cluster-ing quality, which it uses to guide local clustering decisions. We would like that measure to promote two

somewhat-conflicting qualities while striking a balance between them. On one hand, it should favour clus-

terings that result in simple, concise, and efficient models, while on the other hand, it should favour cluster-

ings whose models accurately characterize (or fit) the training data.

Maximum a posterior (MAP) estimation provides a nice framework for combining these two qualities.Given training data

X = X{ } for object , our problem is to choose a model M that best characterizes X.

(In our case, choosing M amounts to choosing a clustering.) We choose an M that maximizes P(M | X),which, by Bayess theorem, also maximizes the product P(M) P(X | M). The terms P(M) and P(X | M)correspond to the simplicity and fit qualities mentioned above.

Model simplicity term

The term P(M) represents, in our case, a prior distribution on clusterings. To favour clusterings that resultin simple models, we use the minimum description length principle (Rissanen 1983) to construct this priordistribution. We define an encoding scheme for models that amounts to concisely enumerating the model

37

graphs, nodes, arcs, attribute vectors, and position vectors with finite precision, using a fixed number of bits

to encode each component. Then, with L(M) denoting the length of Ms encoding, we use the relationshipestablished by Shannon between probability distributions and minimal-length codings to obtain

log P(M ) = L(M ), thus defining P(M).

Data fit term

The term P(X | M) measures the fit between model and training data.6 To evaluate it we could estimate theprobability of each of the models training images according to the best match between image graph and

model graph, which is found by searching over all characteristic views and feature matchings:

P(X | M ) = P(X | M )XX

,where P(X | M ) = max

G M f , E , TP(H | E, T ),

(3.6)

P(H | E, T ) is defined by equation 3.2, and the maximum is taken over model graphs G , matchings Ebetween G and X , and transformations T. However, in the course of learning the model each training image

has already been assigned to a particular characteristic view and matched to the model graph representing

that view. By assuming that the maximum in equation 3.6 is still attained at these same choices of charac-

teristic view and matching, we can avoid any search and instead estimate P(X | M) with a simple producttaken over the model graphs, their nodes, and the sequences of attribute vectors and positions recorded by

those nodes.

Clustering procedure

For simplicity, each training image is assigned to exactly one cluster. (A procedure that is capable ofassigning an image to more than one cluster may require less training data in some cases, but it must solve a

problem of much greater complexity.) Clustering decisions are guided by a desire to maximize P(M | X).The most straightforward clustering strategy assigns each training image to a cluster as it is acquired:

P(M | X) is estimated for each possible assignment, and the assignment yielding the largest value isadopted.

6 A requirement for a good model is that there be limited covariation among features within eachcharacteristic view (section 3.2.2). It can be shown that maximizing P(X | M) favours such models.

38

However, this simple strategy may be sensitive to the order in which training examples are presented. The

sensitivity may be alleviated by adjusting cluster boundaries, merging clusters, and splitting them in thecourse of learning. Experimental evaluation is needed to determine to what extent order sensitivity is a

problem, and what strategies are effective in overcoming it. Possible strategies include:

periodically re-assign the training image, X, that has the lowest P(X | M) with each new training image, consider merging the two clusters best matched by the training image

to identify possible ways of splitting a cluster, apply a numeric clustering algorithm to the

collections of attribute vectors recorded for certain model features

3.5.2 Generalizing training images to form a model graph

Within each cluster, training images are merged to form a single model graph that represents a generaliza-

tion of those images. An initial model graph is formed from the first training image assigned to a cluster by

selecting, from its image graph, the connected subgraph containing the largest-scale token. When a subse-

quent training image is assigned to the same cluster, an optimal match is found between its image graph and

the current model graph. Merging is based on that optimal match. Each model node that matches an image

node is updated by adding the image nodes attribute vector and position to the sequences recorded with the

model node. New nodes are added to the model graph for those image nodes that werent matched by the

model.

3.5.3 Consolidating a model graph

As information accumulates during learning, it may reveal ways in which a model may be simplified to

achieve greater efficiency. The system periodically attempts the following simplifications for each evolving

model graph:

Rare features are eliminated . The model is pruned of features whose likelihood of being detected,P(e j | H) , has dropped below some threshold.

Attribute vector distributions are consolidated . The time and space complexity of the non-paramet-

ric density estimators used to approximate attribute vector distributions increase linearly with the

number of training samples. Thus, to improve efficiency, a distribution represented by a large

number of samples is replaced with a parametric estimator, provided that the parametric estimator

approximates the sampled distribution well enough. A goodness-of-fit test determines when the

substitution can be made.

39

3.6 Summary

To handle variation in object appearance this approach relies on finding an abundance of features fordescribing that appearance. In practice this should work because most objects have many local features thatcan be used to identify them. An image is described in terms of features chosen for stability and signifi-

cance; by retaining some redundancy among those features, we obtain a collection whose stability exceeds

that of its individual members.

The model represents some information about how an object has been observed to vary in appearance.Because this information is recorded independently for each feature, the matching and learning procedures

remain simple and efficient. However, for the s