51
Ph.D. Research Proposal: Learning Object Recognition Models from Images Art Pope Advisor: David Lowe Department of Computer Science University of British Columbia January 1994 Abstract To recognize an object in an image one must have an internal model of how that object may appear. This proposal describes a method for learning such models from training images. The proposed method models an object by a probability distribution that describes the range of possible variation in the object’s appear - ance. This probability distribution is organized on two levels. Large variations are handled by partitioning the training images into clusters that may correspond to distinctly different views of the object. Within each cluster, smaller variations are represented by probability distributions that characterize the presence, posi - tion, and measurements of various discrete features of appearance. The learning process combines an incremental conceptual clustering algorithm for forming the clusters with a generalization algorithm for consolidating each cluster’s training images into a single description. Recognition, which involves match- ing the features of a training image with those of a cluster’s description, can employ information about fea- ture positions, numeric measurements, and relations in order to constrain and speed the search. Preliminary experiments have been conducted with a system that implements some aspects of the proposed method; the system can learn to recognize a single characteristic view of an object in the presence of occlusion and clutter.

10.1.1.25.4781

Embed Size (px)

DESCRIPTION

paper object recognition

Citation preview

  • Ph.D. Research Proposal:

    Learning Object Recognition Models from Images

    Art Pope

    Advisor: David Lowe

    Department of Computer ScienceUniversity of British Columbia

    January 1994

    Abstract

    To recognize an object in an image one must have an internal model of how that object may appear. Thisproposal describes a method for learning such models from training images. The proposed method models

    an object by a probability distribution that describes the range of possible variation in the objects appear-ance. This probability distribution is organized on two levels. Large variations are handled by partitioning

    the training images into clusters that may correspond to distinctly different views of the object. Within eachcluster, smaller variations are represented by probability distributions that characterize the presence, posi-

    tion, and measurements of various discrete features of appearance. The learning process combines an

    incremental conceptual clustering algorithm for forming the clusters with a generalization algorithm for

    consolidating each clusters training images into a single description. Recognition, which involves match-

    ing the features of a training image with those of a clusters description, can employ information about fea-

    ture positions, numeric measurements, and relations in order to constrain and speed the search. Preliminary

    experiments have been conducted with a system that implements some aspects of the proposed method; the

    system can learn to recognize a single characteristic view of an object in the presence of occlusion andclutter.

  • ii

  • iii

    Contents

    1. Introduction .................................................................................................................................... 5

    2. Related Research ............................................................................................................................ 8

    2.1 Representation ....................................................................................................................... 8

    2.1.1 Primitives ................................................................................................................ 8

    2.1.2 Coordinate system ................................................................................................... 9

    2.1.3 Organization............................................................................................................ 9

    2.1.4 Representing object models .................................................................................... 9

    2.2 Recognition ........................................................................................................................... 10

    2.2.1 Perceptual organization ........................................................................................... 10

    2.2.2 Indexing .................................................................................................................. 10

    2.2.3 Matching ................................................................................................................. 11

    2.2.4 Verifying ................................................................................................................. 12

    2.3 Learning to recognize objects ............................................................................................... 12

    2.3.1 Learning an appearance representation ................................................................... 12

    2.3.2 Learning an appearance classifier ........................................................................... 13

    2.3.3 Learning a structural model .................................................................................... 14

    2.3.4 Learning a set of characteristic views ..................................................................... 15

    2.3.5 Learning a recognition strategy............................................................................... 16

    2.3.6 Theoretical learnability ........................................................................................... 16

    3. Approach ......................................................................................................................................... 18

    3.1 Image representation ............................................................................................................. 19

    3.1.1 Concepts .................................................................................................................. 19

    3.1.2 Implementation ....................................................................................................... 21

    3.2 Model representation ............................................................................................................. 22

    3.2.1 Use of characteristic views ..................................................................................... 22

  • iv

    3.2.2 Representation of a characteristic view................................................................... 25

    3.3 Coordinate systems ............................................................................................................... 26

    3.4 Recognition procedure........................................................................................................... 27

    3.4.1 Match quality measure ............................................................................................ 27

    3.4.2 Probability density estimation ................................................................................. 31

    3.4.3 Matching procedure ................................................................................................ 32

    3.4.4 Verification procedure ............................................................................................ 35

    3.5 Model learning procedure ..................................................................................................... 36

    3.5.1 Clustering training images into characteristic views .............................................. 36

    3.5.2 Generalizing training images to form a model graph ............................................. 38

    3.5.3 Consolidating a model graph .................................................................................. 38

    3.6 Summary ............................................................................................................................... 39

    4. Empirical Evaluation ..................................................................................................................... 40

    4.1 Preliminary results ................................................................................................................. 40

    4.2 Analysis ................................................................................................................................. 43

    4.2.1 Better low-level features and verification needed................................................... 43

    4.2.2 Better high-level features needed............................................................................ 44

    4.2.3 Position information used more effectively to constrain matching ........................ 45

    4.3 Proposed evaluation............................................................................................................... 45

    5. Plan .................................................................................................................................................. 47

    References .................................................................................................................................................. 49

  • 51. Introduction

    To recognize an object in an image one must have some expectation of how the object may appear. Thatexpectation is based on an internal model of the objects form or appearance. The research proposed hereseeks to determine how one might construct a system that can acquire such models of objects directly fromintensity images, and then recognize those objects using the models thus acquired. This system would becapable of learning to recognize objects.

    The following scenario illustrates how this recognition learning system would operate. One presents to the

    system a series of example images, each depicting a particular object from various viewpoints. From thoseexamples the system develops a model of the objects appearance. This training is repeated for each of theobjects the system is to recognize. When presented with a test image, the system can then identify and rankapparent instances of the known objects in the image.

    Few object recognition systems have been designed to acquire their models directly from intensity images.Instead, most systems are simply given models in the form of manually-produced shape descriptions.

    During recognition, in order to determine whether some object can have a particular appearance, theobjects shape description must be combined with a model of the imaging process that accounts for thesensor, projection, surface reflectance, and illumination. Therefore, successful recognition depends on hav-ing good models not only of the objects, but also of the image formation process. Because modeling thatprocess has proven difficult, most object recognition systems have been restricted to object models that arerelatively simple and coarse.

    A system that can learn its models directly from images, on the other hand, will enjoy three importantadvantages.

    The learning system will avoid the problem of having to estimate the actual appearance of an objectfrom some idealized model of its shape. Instead, all of the properties of the objects appearanceneeded for recognition will be learned directly by observation. Eliminating the inaccuracies of

    appearance estimation may allow recognition to be accomplished more accurately and robustly.

    The learning system will acquire new models more conveniently. A new model will be defined not

    by measuring and encoding its shape, as with traditional object recognition systems, but by merelydisplaying the object to the system in various poses.

  • 6 The learning system could provide a robot with the ability to learn objects when they are firstencountered and then recognize those objects elsewhere. This ability would be important for a robotroaming in a dynamic and unknown environment.

    The difficulty of the recognition learning problem is largely due to the simple fact that an objects appear-ance can vary substantially. It varies with changes in camera position and lighting andif the object is flex-iblewith changes in shape. Further variation is due to noise, and to differences among individual in-

    stances of the same type of object. Accommodating this variation will be a central theme in the design of arecognition learning system. The scheme used to represent image content must be stable so that whenever

    variation is small, changes in representation are likely to be small also. The scheme used to model an

    objects appearance must describe just what variations are possible. The learning procedure must generalizeenough to overcome insignificant variation, but not so much as to confuse dissimilar objects. And the pro-cedure used to identify modeled objects in images must tolerate the likely range of mismatch betweenmodel and image.

    In investigating the recognition learning problem I propose to concentrate on one particular version of it:

    learning to recognize 3D objects in 2D intensity images. Objects will be recognized solely by the intensityedges they produce (although nothing about the approach I propose here seems to preclude extending it touse other properties too, such as colour and texture). As for the objects themselves, they may be entirelyrigid, composed of a small number of articulations, or somewhat flexible. The system should be able to

    learn to recognize a class of similar objects, accommodating the variation among objects just as it accom-modates the variation among images of a single object. This formulation of the recognition learning prob-lem seems general enough to involve all important aspects of the problem and yield a genuinely useful

    system, yet restrictive enough to permit a solution to be found within the limits of a Ph.D. program.

    The proposed method models an object by a probability distribution that describes the range of possiblevariation in the objects appearance. This probability distribution is organized on two levels. Large varia-tions are handled by partitioning the training images into clusters that may correspond to distinctly different

    views of the object, or to different configurations of its parts. Within each cluster, smaller variations arerepresented by probability distributions characterizing various, discrete appearance features. For each

    feature we represent the probability of detecting that feature in various positions with respect to the overall

    object, and the probability of the feature having various values of feature-specific, numeric measurements.A rich, partially-redundant, and extensible repertoire of features is used to describe appearance.

  • 7The proposed learning method combines an incremental conceptual clustering algorithm for forming the

    clusters with a generalization algorithm for consolidating each clusters training images into a single

    description. Recognition, which involves matching the features of a training image with those of a clusters

    description, can employ information about feature positions, numeric measurements, and relations in order

    to constrain and speed the search for a match. Preliminary experiments have been conducted with a system

    that implements some aspects of the proposed method; that system can learn to recognize a single view of

    an object among other occluding and distracting objects.

    In this introduction I have described the problem to be considered, its significance, the source of its diffi-

    culty, and the outline of a solution. Chapter 2 discusses relevant work by others on this and similar prob-

    lems. Chapter 3 describes the approach I propose for solving the problem. Some aspects of the approach

    have already been implemented and tested, as reported in chapter 4. Chapter 5 presents a plan for complet-

    ing the investigation.

  • 82. Related Research

    Before reviewing related research on learning to recognize objects we will survey some more generalaspects of object recognition. The general survey is organized as two sections: the first on the representa-tion of object models and images, and the second on finding suitable matches between them. These twosections cover a large subject rather quickly while focusing on topics particularly relevant to the researchproposed here. A more thorough review of object recognition has been prepared separately (Pope 1994).

    The two sections of general survey are followed by a third specifically on the use of learning in objectrecognition. Since there has been relatively little work done on this aspect of object recognition, the thirdsection can cover its topic in somewhat greater detail than can the preceding two sections; the third section

    presents a fairly complete survey of the role of learning in object recognition.

    2.1 Representation

    Representation schemes are needed both for object models and for image contents. Since these twoschemes must be closely related to facilitate matching, much of the discussion that follows pertains to both.

    2.1.1 Primitives

    Various criteria favour a representation that employs localized primitives having a variety of types and

    scales. These primitives should be tailored to their application in two respects. First, they should make

    explicit any information needed to recognize and distinguish objects. Second, they must employ naturalregularities (e.g., smoothness) to achieve stability. Commonly used for representing the shape of 2D inten-sity edges are segments of algebraic curves (including straight lines as well as higher-order curves, such asellipses and general conics); parts or regions (e.g., ribbons formed by the projection of generalized cylin-ders); and various distinguished features such as significant changes in curvature or an application-specificset (e.g., eye, nose, and mouth templates).

    For recognition it is desirable to use primitives that are somewhat invariant with respect to changes in pose,

    lighting, and other confounding factors. Since it has been shown by Clemens and Jacobs (1991), amongothers, that there is no general invariant suitable for recognizing 3D objects in 2D images (i.e., no propertyof a general arrangement of 3D points that is invariant under projectivities), attention has focused on find-

  • 9ing approximate invariants (e.g., Ben-Arie 1990), and invariants for suitably constrained structures (e.g.,coplanar sets of points; Lamdan, Schwartz, and Wolson 1988).

    In most recognition work the repertoire of primitives has been finite and predetermined. However, achiev-

    ing good recognition performance across numerous objects will probably require some ability to learn newtypes of primitives as needed (i.e., feature construction). A preliminary example of such learning has beengiven by Segen (1989), whose system learns primitives that represent useful configurations of distinguishedpoints (see section 2.3.1). Also, whereas objects have usually been described by a relatively small numberof primitives (perhaps a few dozen), recognition under widely varying conditions will require the use ofmuch richer representations that have useful levels of redundancy.

    2.1.2 Coordinate system

    A scheme for representing object shape may use a coordinate system that is either object-centred (explicitlyrecording 3D structure) or viewer-centred (recording a series of 2D views, or characteristic views). Bothhave been used extensively for recognition. An object-centred representation will describe an object moreaccurately and concisely, but the description cannot be generated easily using intensity images acquired

    from unknown viewpoints. Instead, a viewer-centred description can be created by clustering views, as

    some researchers have done when creating viewer-centred descriptions from object-centred ones (e.g.,Pathak and Camps 1993). The viewer-centred representations accuracy is enhanced, and its storagerequirements reduced, by using primitives that remain relatively stable over a range of viewpoints.

    2.1.3 Organization

    A shape representation will often organize primitives according to some combination of part/whole rela-

    tions, degrees of descriptive accuracy, and adjacency relations. These organizations are frequently formal-ized as attributed graph structures in which nodes denote primitives, arcs denote their relations, and both

    nodes and arcs have attributes that record measurements (e.g., Wong, Lu, and Rioux 1989).

    2.1.4 Representing object models

    Modeling an object that has a varying or uncertain appearance (e.g., a flexible object) requires some way ofrepresenting the possible range of variation. For articulate objects, one approach has been to decompose theobject into parts whose relative positions are described by free variables having specified ranges (e.g.,Brooks 1981). More general deformations can be modeled by an attributed graph in which attributes are

  • 10

    characterized by probability distributions (e.g., Wong and You 1985). In addition, an object model mayinclude information about the reliability of various object features and the expected cost of detecting themin images (e.g., Kuno, Okamoto, and Okada 1991). These factors can be used to evaluate evidence of theobjects presence in an image, and to optimize the strategy for collecting that evidence. It has proven diffi-cult to estimate, from analytic models of lighting, sensor, and surface characteristics, such things as the

    reliability and variation of various object features (Chen and Mulgaonkar 1992). Methods that learn objectmodels directly from images seek to bypass that difficulty.

    2.2 Recognition

    Recognizing an object in an image is generally accomplished in four steps: perceptual organization, inwhich interesting arrangements of features are identified in the image; indexing, in which likely candidates

    are selected from a library of models; matching, in which an optimal match is found between features of the

    image and those of a model; and verification , in which a decision is made about whether the optimal matchrepresents an instance of the object in the image. These four steps are considered individually in the follow-ing subsections. An important research goal has been to minimize search, which is substantially involved in

    each step of the recognition process.

    2.2.1 Perceptual organization

    This step takes simple image features (e.g., edgels or straight lines) and groups them to form larger featuresthat are more descriptive and, consequently, more useful for indexing. Grouping is based not on knowledge

    of specific objects, but rather on general assumptions that are meant to hold for most objects and situationsencounterede.g., the assumption that most objects are composed of a small number of convex parts(Jacobs 1989). The goal is to form groups that are likely to come from a single object, yet unlikely to ariseaccidentally due to a chance arrangement of objects (Lowe 1985). To be useful for recognition, however, agrouping method need only be partially successful since spurious groups will be eliminated by subsequent

    stages. With some notable exceptions (Lowe 1985; Sarkar and Boyer 1993), grouping methods are usuallyad hoc, controlled by numerous hand-picked parameters, and entirely bottom-up (e.g., Bergevin and Levine1992).

    2.2.2 Indexing

    Features found in the image are used to identify likely models, and perhaps also starting points for

    matching those models to the image. Frequently used are hash tables, which identify model hypotheses

  • 11

    from hashed feature measurements (e.g., Lamdan, Schwartz, and Wolfson 1988). Radial basis functionshave been used to map feature measurements to a series of graded yes/no responses, one per model

    (Brunelli and Poggio 1991); they do not appear to provide any advantage over hash tables, however.Hierarchical indices, which organize models according to an abstraction hierarchy, have been studied by

    few researchers (e.g., Sengupta and Boyer 1993) but they appear promising for reducing overall search.

    2.2.3 Matching

    The goal is to find a match between image features and model features; that match must correspond to

    some possible view of the modeled object and be, in some sense, optimal. How one measures match qualitydetermines not only what matches will be preferred, but also what match algorithm can be used. The fastest

    matching algorithm reported so far, Breuels RAST method (1992), is analogous to a Hough transform inthat it searches a transformation space. RAST uses a relatively simple measure that assumes a rigid object,requires corresponding features to fall within some fixed range of each other, and weights each feature

    correspondence equally. However, it may be difficult to extend RAST to handle articulate or flexible

    objects, or models whose features vary in importance.

    Other methods search the space of possible pairings between model and image features. These correspon-

    dence space methods (e.g., Grimson 1990) can easily incorporate a variety of matching constraints andmore powerful match quality measures, but they are time-consumingthey require time exponential in the

    number of model or image features. Correspondence space searches can be shortened by using a branch-

    and-bound technique to prune the search tree, or by accepting near-optimal solutions. Alignment methods

    do the latter and have proven to be quite effective (Ayache and Faugeras 1986; Lowe 1987a, b; Ullman1989). They match the first few features with a correspondence space search, use those matches to estimatethe pose of the object, and then locate additional matches within the constraints provided by the pose esti-mate.

    For an efficient correspondence space search, some attention must be given to the order in which features

    are matched. At each stage one should choose a feature that is likely to be found in the image when the

    object is present, is unlikely to arise by chance, is relatively unique among model features, is well localizedby previous matches, and can effectively constrain the remaining matches. Some have described methods

    of quantifying these factors so that features are considered in an optimal order (e.g., Kuno, Okamoto, andOkada 1991). Another method of ordering the search is to structure it hierarchically, first recognizing

  • 12

    generic object parts, and then recognizing objects in terms of those parts (Burns and Riseman 1992;Dickinson, Pentland, and Rosenfeld 1992).

    2.2.4 Verifying

    Once a match has been identified, it must be decided whether that match actually represents an instance of

    an object in the image. Often this is done by projecting all model edges into the image and searching forimage edges near them; parallel image edges are counted as positive evidence, and crossing ones, as nega-

    tive evidence (e.g., Huttenlocher and Ullman 1990). More formally, one can use an analytic model offeature occurrence to estimate the probability that a particular match is accidental (i.e., due to a conspiracyof accidental features), and accept the match only if this probability is tiny (e.g., Sarachik and Grimson1993). Breuel (1993) has suggested that a model of scene structure be used in evaluating the match to judgewhether missing features can be explained away by a plausible series of occlusions.

    2.3 Learning to recognize objects

    There are several possible roles for learning in an object recognition system. These are surveyed in thefollowing sections. A final section summarizes efforts to establish theoretical bounds on the learnability of

    object recognition.

    2.3.1 Learning an appearance representation

    An object recognition system may learn new types of shape or appearance primitives to use in describingobjects. Typically this is done by clustering existing primitives or groups of them, and then associating newprimitives with the clusters that have been found. New primitives thus represent particular configurations or

    abstractions of existing ones. The new configurations may improve the representations descriptive power,

    and the new abstractions may allow more appropriate generalizations.

    Segen (1989) has demonstrated this approach with a system that learns a representation for 2D contours.The systems lowest-level primitives are distinguished points, such as curvature extrema, found on contours

    in training images. Nearby points are paired, each pair is characterized by a vector of measurements, the

    measurement vectors are clustered, and a new primitive is invented for each cluster of significant size.

    Consequently, each new primitive describes a commonly observed configuration of two distinguished

    points. The induction process is repeated with these new primitives to generate higher-level primitives

    describing groups of four, eight, and more distinguished points. One would expect this method to be sensi-

  • 13

    tive to clutter in the training images and to parameters of the clustering algorithm; unfortunately, Segen has

    not reported on that sensitivity.

    Two other examples of representation learning will round out this portion of the survey. Delanoy, Verly,

    and Dudgeon (1992) and Fichera et al. (1992) have described object recognition systems that induce fuzzypredicates from training images. Their systems represent models as logical formulae, and, therefore, the

    systems need appropriate predicates. These are invented by clustering measurements of low-level primi-

    tives that have been recovered from training images. Turk and Pentland (1991) and Murase and Nayar(1993) have induced features using principle component analysis. They compute the several most signifi-cant eigenvectors, or principle components, of the set of training images. Since these few eigenvectors span

    much of the subspace containing the training images, they can be used to concisely describe those images

    and others like them.

    2.3.2 Learning an appearance classifier

    The object recognition task can be characterized, in part, as a classification problem: instances representedas vectors, sets, or structures must be classified into categories corresponding to various objects. A wealthof techniques has been developed for classifying, and for inducing classifiers from training examples. For

    the purposes of object recognition, the important considerations distinguishing these techniques include theexpressiveness and complexity of the input representation (e.g., vectors are easier to classify than struc-tures), the generality of the categories learned, the ability to cope with noisy features, the number of train-ing examples needed, and the sensitivity to the order in which examples are presented.

    Jain and Hoffman (1988) describe a system that learns rules for classifying objects in range images. Theinstances classified by their system are sets of shape primitives with associated measurements. The classi-

    fier applies a series of rules, each contributing evidence for or against various classifications. Each rule

    applies to a particular type of shape primitive and a particular range of measurements for that primitive.

    These rules are learned from training images by extracting primitives from the images, clustering them

    according to their measurements, and associating rules with the clusters that derive primarily from a single

    object. Because this system does not learn constraints governing the relative positions of the shape primi-tives, it appears to have a very limited ability to distinguish among objects that have different arrangementsof similar features.

  • 14

    Neural networks, including radial basis function networks, have been used as trainable classifiers for objectrecognition (e.g., Brunelli and Poggio 1991). In this role, the network approximates a function that maps avector of feature measurements to an object identifier, or to a vector of graded yes/no responses, one perobject. For this approach to succeed the function must be smooth; furthermore, as with any classifier, theobject categories must be shaped appropriately in feature space (e.g., Gaussian radial basis functions arebest suited for representing hyperellipsoids). Nearest neighbour classifiers, which make fewer assumptions,are also commonly used. Rare are comparative evaluations of how various classifier/feature space combi-

    nations perform on object recognition problems (such as Brunelli and Poggio 1993).

    2.3.3 Learning a structural model

    A structural model explicitly represents both shape primitives and their spatial relationships. Its structure is

    thus analogous to that of the modeled object, and it is often represented as a graph or, equivalently, as aseries of predicates. In general, a structural model is learned from training images by first obtaining struc-

    tural descriptions from each image, and then inducing a generalization covering those descriptions.

    Connell and Brady (1987) have described a system that learns structural models for recognizing 2D objectsin intensity images. The system incorporates many interesting ideas. They use graphs to represent the

    part/whole and adjacency relations among object regions that are described by smoothed local symmetries(ribbon shapes). An attribute of a region, such as its elongation or curvature, is encoded symbolically by thepresence or absence of additional graph nodes according to a Gray code. A structural learning procedure

    forms a model graph from multiple example graphs, most commonly by deleting any nodes not shared by

    all graphs (the well-known dropping rule for generalization). Similarity between two graphs is measured bya purely syntactic measure: simply by counting the nodes they share. Consequently, this system accords

    equal importance to all features, and it uses a somewhat arbitrary metric for comparing attribute values.

    The approaches described below involve more powerful models that represent probability distributions over

    graphs. A Markov or independence condition is assumed so that the high-order, joint probability distribu-tion can be approximated by a product of low-order distributions, one per node or arc. The probability of a

    particular graph instance is then defined according to a partial match between that instance and the model.

    Wong and You (1985) represent a model as a random graph in which nodes represent shape primitives,arcs and hyperarcs represent relations among them, and both have attribute values characterized by discrete

    probability distributions. An attributed graph (i.e., a random graphs outcome) is treated as just a special

  • 15

    case of random graph. An entropy measure is defined on random graphs, and the distance between two

    random graphs is defined as the increment in entropy that would result from merging the two (the minimalincrement over all possible mergings). A random graph is synthesized from examples by repeatedly merg-ing the two nearest graphs. This learning method seems to have been demonstrated only with 2D recogni-

    tion problems and clean, synthetic images. However, McArthur (1991) has extended the random graphformalism to allow continuous attributes, and he has used it to learn 3D, object-centred models from imageswith known viewpoints.

    In Segens system (1989) for recognizing 2D shapes, a graph node represents a relation among shape primi-tives, which are oriented point features (see section 2.3.1). But instead of specifying a particular relation, anode in the model graph specifies a probability distribution over possible relations. For example, relations

    involving two features may include one that holds when the features are oriented away from each other, and

    another that holds when they are oriented in the same direction; a model node may specify that the first

    relation holds with probability 0.5, and the second, with probability 0.3. When a graph instance is matched

    with the model, the instances probability can be computed by assessing the probability of each instance

    node according to the distribution of its corresponding model node. Recognition involves finding the most

    probable match. A model is learned incrementally from instances by matching it with each one succes-

    sively; following each match the probabilities recorded in matching model nodes are adjusted and anyunmatched instance nodes are added to the model.

    2.3.4 Learning a set of characteristic views

    In learning a model that is to be represented as a set of characteristic views, part of the task is to choose

    those views. One can cluster the training images (a form of unsupervised learning) to create one characteris-tic view from each cluster. Gros (1993) and Seibert and Waxman (1992) have done this with real images,while several others have clustered images rendered from CAD models. Gros measures the similarity of an

    image pair as the proportion of matching shape primitives. Clusters are formed by successively merging

    nearby images, and then a view-specific model is created from each cluster. Seibert and Waxman use a

    comparable method; since they are interested in recognizing aircraft in flight, their system also learns a

    Markov model characterizing transitions between characteristic views.

  • 16

    2.3.5 Learning a recognition strategy

    Draper (1993) has considered how a system equipped with a variety of special-purpose representations andalgorithms might learn strategies for employing those techniques to recognize specific objects. A typicalrecognition task would be to locate a tree by fitting a parabola to the top of its crown.1 For this task, an

    appropriate strategy is to segment the image, extract regions that are colored and textured like foliage,

    group these into larger regions, smooth region boundaries, and fit parabolas to the boundaries. A human

    supplies training examples by pointing out the desired parabolas in a series of images. The system then

    evaluates various strategy choices (e.g., whether to smooth region boundaries) and parameter choices (e.g.,the degree of smoothing) for cost and effectiveness. From these evaluations it constructs a strategy, includ-ing alternatives, for performing the task on new images. By learning what order to try alternative recogni-

    tion methods, Drapers system differs from those that just select a single set of features useful for recogni-tion.

    2.3.6 Theoretical learnability

    There have been some efforts to identify theoretical limits on what can be learned from images. These

    efforts are based on the analogy that learning to recognize an object from images is like to learning aconcept from examples. Valiant (1984) has provided a useful definition for characterizing the class ofconcepts that can be learned by a particular algorithm: Informally, a class is probably approximately

    correct (PAC) learnable by an algorithm if, with high probability, the algorithm learns a concept thatcorrectly classifies a high proportion of examples using polynomially bounded resources (and, conse-quently, number of training examples); the bound is a polynomial of both the accuracy and some naturalparameter of the concept class (e.g., vector length for concepts defined on a vector space). Significantly, thealgorithm has no prior knowledge about the distribution of examples.

    Shvayster (1990) has shown that some classes of concepts defined on binary images are not PAC-learnablefrom positive examples by any algorithm. For example, suppose a template is said to match an image when

    every black pixel in the template corresponds to a black pixel in the image (though not necessarily viceversa). Then the concept consisting of all instances not matching some unknown template is not PAC-

    1 This example is from Draper 1993, pp. 93119.

  • 17

    learnable. Shvayster speculates that some nonlearnable concepts may become learnable, however, if some

    prior knowledge about the distribution of examples is available.

    Edelman (1993) has argued that Shvaysters negative result is not applicable to object recognition becauseit uses an instance representation, the binary image, that is inappropriate. If instead instances are repre-

    sented by vectors of point feature locations, Edelman shows, then recognition of an object can be learnedfrom a polynomial number of positive examples. He concludes that model learning may be practical, pro-

    vided an appropriate representation is chosen.

  • 18

    3. Approach

    This chapter describes an approach to the design of a system that learns to recognize objects. The discus-sion is organized around four central issues, with one section devoted to each. The four issues are:

    How is an image represented?

    How is a model represented?

    How is recognition accomplished?

    How are models learned from images?

    A fifth issue, how the database of models is organized and accessed, must also be considered in designing a

    complete object recognition system that is meant to work with a large repertoire of objects. To limit itsscope, however, the proposed research will not study this issue in any depth. Instead, the simplest possible

    organization and access methods will be used for the database of models: a list of models will be kept, and,

    during recognition, each model will be tested in sequence. In future, this aspect of the system could be

    improved readily using existing techniques for managing model databases. For example, efficient access to

    a large database could be maintained by constructing hash table indices, factoring out subparts common to

    multiple models, and clustering similar models.

    The proposed approach to recognition and learning may be summarized as follows: A model is organized

    as a set of characteristic views, each representing a probability distribution over collections of local appear-

    ance features. An image is represented using similar appearance features. Recognizing an object in animage requires a search for the most probable correspondence between model features and image features.

    When learning a model, the system uses a clustering algorithm to divide training images among character-

    istic views; within each view cluster, training images are consolidated into a single description by matching

    and generalizing them. Features found to correspond in training images provide the samples from which the

    probability distributions are estimated.

    Some aspects of this approach have already been implemented and tested, as reported in chapter 4. The

    approach described here incorporates lessons learned from that experience.

  • 19

    3.1 Image representation

    This section begins with an explanation of the general concepts underlying the proposed image representa-

    tion. This is followed by more specific information on the representation to be implemented, and the

    method by which descriptions in that representation are derived from images.

    3.1.1 Concepts

    The system represents an image in terms of specific properties, which we call features.2 Each feature has aparticular type and location within the image. It may, for example, be a segment of intensity edge, a particu-

    lar arrangement of such segments, a region of uniform texture, or one of uniform colour. Some low-level

    features may be defined as being the responses of feature detectors, such as edge or corner detectors; other,

    higher-level features may be defined as certain groupings or abstractions of the low-level ones. A typical

    image will be described by numerous features of various types.

    The design of features should not be critical to the performance of the system; what is important is that

    there be a wide range of features able to provide a rich, somewhat redundant description of any image.

    Good features are those that can be detected reliably, and are relatively invariant with respect to unimpor-

    tant changes (for our purposes, those are modest changes in lighting or viewpoint). For efficiency in recog-nition it is useful to have some highly-selective features that usually occur only in the presence of certain

    objects. For example, in recognizing manufactured objects, features denoting various geometric arrange-ments of intensity edges may serve this role well. Some more commonplace features, such as simple edge

    fragments, should supplement the highly-selective ones so that the overall repertoire can still describe a

    wide variety of objects, at least at a basic level. Of course, distinctions can only be made if the repertoireincludes features that respond to those distinctions.

    A token denotes a feature that has been identified in an image. It records the type of feature, its position

    within the image, and a vector of numeric attributes that help to characterize the feature. An L junctionfeature, for example, may have one attribute for the junctions angle and another for the relative lengths of

    2 Throughout this chapter, the proposed system is described using the present tense rather than the futuretense. This is for readability, and to allow the material to be re-used readily in other texts, such as athesis.

  • 20

    its two segments. Attributes are expressed so that they are invariant with respect to translation, rotation, and

    scaling of the feature within the imageusing, for example, scale-normalized measures (Saund 1992).

    A features position is represented by a point location in the image plane, an orientation, and a scale, all

    measured according to some convention appropriate for that type of feature. For example, an L junctionslocation may be recorded as the location of its vertex, its orientation may be that of its bisector, and its scale

    may be the diameter of the smallest circle enclosing the segments forming the junction. A feature whoseorientation is ambiguous requires special treatment. If there are just a few different ways of assigning anorientation to the feature then it is represented by multiple tokens, one per orientation. A straight line seg-

    ment, for example, is represented by two tokens with opposite orientations. (This duplication simplifiestoken operations such as comparison and grouping.) If, like a circle, the feature has no well-defined orienta-tion, its tokens orientation is left undefined.

    Tokens denoting the features found in an image are organized in a graph structure, which we call an image

    graph . Graph nodes represent tokens, and directed arcs represent composition and abstraction, as when one

    feature is found by grouping or abstracting others.

    Formally, an image graph G is denoted by a tuple F, R , where F is a set of image feature tokens and R is arelation over elements of F. An image feature token fk F is a tuple tk, ak, bk, where tk is the featurestype, ak is its attribute vector, and bk is its position. The domain of a features attribute vector depends on

    the features type. Section 3.3, below, describes how positions such as bk are represented. Finally, a relation

    in R is a tuple k, l1, , lm, indicating that the image feature k was found by grouping or abstracting imagefeatures l1 through lm.

    It is important to note that the image graph deliberately retains information that may be redundant. Some

    researchers, having other purposes in mind, use procedures that retain only the best of competing descrip-

    tions, or that discard tokens once they have been grouped to form a new token. Here we retain all tokens

    because doing so makes the overall representation both more expressive and more stable. Consider, for

    example, an edge segment in one image that yields both a line token and an arc token. Together the two

    tokens provide more information about the segment than either does alone. When a second image is

    acquired from a slightly different viewpoint the same segment may be perceived as being slightly more

    curved, and it no longer yields the line. A third image may produce the line, but not the arc. But since the

    first image graph included both the line and arc tokens it differs only partially from either the second or

  • 21

    third graph. As this example illustrates, we achieve a more stable representation by permitting some redun-

    dancy.

    3.1.2 Implementation

    For a first implementation of the system we have chosen to use features based solely on intensity edges,

    which provide some degree of illumination and viewpoint invariance. Table 3.1 summarizes these features

    and their attributes.

    Table 3.1. Types of features used to describe image content.

    Feature Type Attributes Position a

    edgel (none) LOb

    straight line segment (none) LOb

    elliptical arc segment angular extenteccentricity

    LOS

    line/line junction corner angle LO

    arc/arc junction corner angleratio of arc radii

    LOS

    line/arc junction corner anglescalenormalized arc radius

    LOS

    junction pair with common line sum of corner anglesdifference of corner angles

    LOS

    junction triple with common lines angle of central cornerratio of two interior sides

    LOS

    parallel lines angle between lines LOS

    parallel arcs angle between arcsangle spanned by arcsscalenormalized average separation

    LOS

    closed convex region compactnesseccentricity

    LOcS

    ribbon scalenormalized average width LOS

    a Which feature position components are well-defined. L: location; O: orientation; S: scale.b Orientation has a two-way ambiguity.c Orientation is undefined if eccentricity is low.

  • 22

    The most basic features are edgelsshort, fixed-length strands of connected edge pixels that have been

    identified by an edge detection operator and thinned to single-pixel width. Connected chains of edgels are

    segmented into straight lines and elliptical arcs.3 The segments are then grouped successively, and in vari-

    ous ways, to form higher-level features. This initial repertoire of features and feature attributes could be

    expanded, as needed, to allow richer descriptions of appearance.

    The system uses a straightforward grouping method that is entirely bottom-up (i.e., data-driven) and con-trolled by numerous thresholds and other parameters. Table 3.1 indicates the order in which features are

    identified. Junctions, for example, are found by exhaustively testing all pairs of segments, measuring each

    pairs geometry against several thresholds. Regions and ribbons are found by identifying in the image graph

    particular configurations of low-level features, such as junctions.

    Figure 3.1 shows the features found in one image. As this example demonstrates, the method is far from

    perfect. Compared to human vision it seems to miss some features and hallucinate others. However, any

    grouping method is bound to have these faults to some degree. It remains the task of the matching and

    learning procedures to accommodate any uncertainty inherent in the grouping process. Although more

    sophisticated grouping methods have been developed recently (e.g., Sarkar and Boyer 1993), this simplemethod, with some minor improvements, appears adequate for the experiments proposed here.

    3.2 Model representation

    3.2.1 Use of characteristic views

    To fully and accurately describe the range of possible variation in an objects appearance, a model is orga-nized on two levels. Significant variations in appearance are handled by subdividing the model into a set of

    characteristic views; because these are independent each can describe an entirely different aspect of the

    object. Within the characteristic views, smaller variations can be represented since each characteristic viewdefines a probability distribution over a variety of similar image graphs.

    3 This repertoire of straight lines and elliptical arcs may prove inadequate for representing and recogniz-ing a useful variety of test objects. In that case, the repertoire should be extended by one or two addi-tional types of parameterized curve. Leonardis (1993) shows how to recover straight, parabolic, andelliptical arcs from edge data.

  • 23

    (a) (b)

    (c) (d)

    (e) (f)

    (g) (h)

    Figure 3.1. An illustration of the features used to describe an image. The 365 512 pixel image (a)yielded 280 edge segments (b), 200 circular arcs and straight lines (c), 248 junctions (d),409 junction pairs (e), 75 parallel line and arc pairs (f), 21 regions (g), and 14 ribbons (h).

  • 24

    This is a viewer-centred model, and it is preferred over an object-centred one for several reasons. Mostimportantly, the viewer-centred model is more easily learned. To learn an object-centred model, one has torecover the 3D location of each feature. That is difficult because the viewpoint of each training image is

    unknown, there is uncertainty in the measurement of each features image location, and there may be errors

    in matching features among images. To learn a viewer-centred model it is not necessary to recover 3D

    structure, and difficulties like unknown viewpoints or appreciable measurement uncertainty are more easily

    managed.

    Each characteristic view describes a range of possible appearances by defining a joint probability distribu-tion over image graphs. Because the space of image graphs is enormous, however, it is not practical to

    represent or learn this distribution in its most general form. Instead, the joint distribution is approximated:the various features making up the characteristic view are treated as independent so that the joint distribu-tion can be decomposed into a product of marginal distributions, one per feature.4 As we shall see, this

    approximation greatly simplifies the representation, matching, and learning of models.

    One consequence of the simplifying approximation is that statistical dependence (association or covariance)among model features cannot be accurately represented in a single characteristic view. We might consider,

    as an extreme example of such dependence, an object for which two subsets of features are mutually exclu-siveonly one of the two subsets appears in any image. This association among features cannot be repre-

    sented by a single characteristic view, with its independent features; it can, however, be represented per-

    fectly by two characteristic views, each containing one of the two subsets.

    In fact, by using a large enough set of characteristic views we can model any object as accurately as wemight wish. However, for the sake of efficiency we would prefer to use relatively few views and let each

    represent a moderate range of possible appearances. Part of the challenge of learning a model, then, is to

    strike an appropriate balance between the number of characteristic views used, and the accuracy of those

    views over their ranges. This issue will be revisited when we discuss the model learning procedure in

    section 3.5.

    4 Similar feature independence assumptions are often used to simplify other computer vision problems.

  • 25

    3.2.2 Representation of a characteristic view

    A single characteristic view is described by a model graph (figure 3.2). Like an image graph, a model graphhas nodes that represent features and arcs that represent composition and abstraction relations among fea-

    tures. Each node records the information needed to estimate three probabilities:

    The probability of observing this feature . Or, more precisely, the probability of observing it in animage depicting the characteristic view of the object. The node records the number of times that itsfeature has been identified in training images, and a count is also kept of all training images used to

    learn the overall model graph. The probability is estimated from these two numbers.

    Given that this feature is observed, the probability of it having a particular set of attribute values. Thisis characterized by a probability density function since attribute values are continuous. Prior to learn-

    ing, little can be assumed about the form of this density function for any particular feature; it may

    depend on many factors such as possible deformations of the object, variations in lighting, and effectsthat uncertainty in measuring low-level features may have on the attributes of higher-level ones.

    Consequently, the system uses a non-parametric density estimator that makes relatively few assump-

    tions. The estimator, which is described in section 3.3.2, requires samples drawn at random according

    to the distribution to be estimated. These samples are attribute vectors that have been acquired from

    training images and are recorded by the model graph node. As samples from the unknown distribution

    accumulate, the learning procedure may discover that the distribution can be approximated well by a

    simple, parametric form (e.g., a multinormal distribution). In that case, an optimization is performed:the sequence of attribute vectors is replaced by a handful of distribution parameters from which proba-

    bility density estimates can be computed more quickly.

    Given that this feature is observed, the probability of it having a particular position . This is also char-acterized by a probability density function and estimated from a sequence of samples obtained from

    training images. The positions of all features in a single model graph are represented using a common,

    arbitrarily-fixed coordinate system. Each features position is modeled as though it were independently

    distributed with respect to that coordinate system. We approximate this distribution using a normal

    distribution, with mean and covariance parameters estimated from the sequence of position samples.

  • 26

    model graph

    no. of feature observations

    sequence of attribute vectors

    sequence of positions

    node node node

    node

    node node node node node node

    nodenodenode

    object model

    type

    Figure 3.2. Schematic illustration of model representation. An object is modeled by one or moremodel graphs, each describing a set of features and their relations. A feature is represented by anode recording information supporting probability estimates: the probability of the feature beingobserved, of it having particular attribute values, and of it having a particular position relative toother features.

    Formally, a model graph G is denoted by a tuple F , R , m , where F is a set of model feature tokens, Ris a relation over elements of F , and m is the number of training images used to produce G . A model

    feature token f j F is a tuple t j , m j , Aj , Bj , where t j is the features type, m j is the number of trainingimages in which the feature was found, and Aj and Bj are the sequences of attribute vectors and positions

    drawn from those training images. Finally, a relation in R is a tuple j , l1, , lm, indicating that modelfeature j is a grouping or abstraction of model features l1 through lm.

    3.3 Coordinate systems

    A features position is specified by a location, orientation, and scale expressed in terms of some 2D, carte-

    sian coordinate system. To locate image features, we use an image coordinate system whose frame is the

    cameras focal plane. To locate model features, we use a single, arbitrarily-fixed model coordinate system

    for all features within a model graph.

    Part of the task of matching a model with an image is to determine a viewpoint transformation that brings

    the image and model features into close correspondence. In our case, this viewpoint transformation is a 2D

    similarity transformation from image coordinates to model coordinates.

    By choosing carefully the manner in which positions and transformations are represented we can ensure

    that the transformation may be expressed as a linear operation, thus simplifying the matching process. One

    formulation that achieves this represents the transformation as a vector. (This formulation has been used by

  • 27

    Ayache and Faugeras, 1986, among others.) A transformation, T, that is equivalent to a rotation by , ascaling by r , and a translation by [tx ty ]

    T is represented by the vector t = [tx ty tu tv ]

    T, where

    = arctan(tv tu ) and r = tu2 + tv2 . Similarly, a feature with orientation , scale s , and location [x y]T isrepresented by the vector b = [x y u v], where = arctan(v u) and s = u2 + v2 . The transformation of theposition vector b , written T(b), is obtained by pre-multiplying t with a matrix derived from b:

    b =

    x

    y

    u

    v

    = T(b) =

    1 0 x y

    0 1 y x

    0 0 u v

    0 0 v u

    tx

    ty

    tu

    tv

    This formulation allows a transformation to be estimated easily from a set of feature correspondences.

    Given a pair of image and model features with positions b and b , the transformation aligning the two

    features can be obtained as the solution to a set of four linear equations given by b = T(b). With additionalfeature pairs the problem of estimating a transformation becomes over-constrained, and a solution that is

    optimal in the least-squares sense can be found directly through least-squares estimation as described in

    section 3.4.3.

    3.4 Recognition procedure

    To identify an instance of an object in an image the system must find a match between a model graph andan image graph. That match will likely be a partial one, with some image features not explained by the

    model (perhaps because there are other objects in the scene) and some model features not found in theimage (perhaps because of shadows or occlusion). However, the partial match should also be optimal insome sense, maximizing both the number of nodes matched and the quality or closeness of those matches.

    Those two factors are evaluated by a match quality measure, which is described below in section 3.4.1. The

    measure relies on estimates of probability density, some of which are supplied by the estimator described in

    section 3.4.2. In section 3.4.3, the procedure for finding a match between a model graph and an image

    graph is described. Once found, a match is verified as described in section 3.4.4. To improve the presenta-

    tion, model graph nodes and image graph nodes are often simply referred to as model and image features.

    3.4.1 Match quality measure

    The matching procedure is guided by a match quality measure that considers several factors, including

    which features match, how similar their attribute values are, how closely their positions correspond, and

  • 28

    how uncommon the features are. Each factor is weighted according to past matching experience as recorded

    by the model. The measure combines empirical statistics using the rules of Bayesian probability theory to

    judge the likelihood that a given pairing of model features with image features corresponds to an actualinstance of the object in the image.

    A matching between image and model features is described by a sequence of tuples, E = ei{ } . Each tuple iseither of the form j,k, meaning that model feature j matches image feature k, or of the form j, , meaningthat model feature j matches nothing. A complete matching is one that contains a least one tuple for eachmodel feature.

    Let H denote the hypothesis that an object represented by model graph G is present in some image. Weseek a match E and a viewpoint transformation T that maximize P(H | E, T), which can be written

    P(H | E, T ) = P(E T | H)P(E T )

    P(H) = P(E | T , H) P(T | H)P(E T )

    P(H) . (3.1)

    There is no practical way to represent in full generality the high-dimensional, joint probability density func-tions P(E | T , H) and P(E T ). Instead, we must approximate them using products of low-dimensional,marginal probability density functions. The approximation allows us to rewrite equation 3.1 as

    P(H | E, T ) P(ei | T , H)P(ei )i

    P(T | H)P(T ) P(H) . (3.2)

    The approximation would be entirely accurate if the following two independence properties were to hold:

    (a) ei{ } is collectively independent given knowledge of T and H(b) ei ,T{ } is collectively independent in the absence of any knowledge of H

    However, in practice we can expect these independence properties to hold only somewhat. Given that an

    object is present at a particular pose, features detected at widely separate locations on the object may belargely independent, thus preserving property (a). And in a random scene cluttered with unknown objects,even nearby features may be largely independent, thus preserving property (b). The properties fail to hold,however, whenever there is redundancy among features. For example, a collection of nearby features is not

    independent from some additional feature that represents a noteworthy grouping of them; in such cases,

    equation 3.2 will overrate the significance of matching all of the features. Nevertheless, with redundancy

    present throughout a model graph, this bias may have a negligible effect on the outcome of any particular

  • 29

    matching problem. Our research strategy is pragmatic: we have adopted equation 3.2 as a first approxima-

    tion, and will replace it with a more careful approximation if that proves necessary.

    Equation 3.2, then, forms the basis of our match quality measure. We assume that all viewpoint transfor-

    mations are equally likely a priori, and thus P(T) is a constant. P(H) is the prior probability that an object asdepicted by G is present in an image; it can be estimated from the proportion of training images that

    matched G and were used to create G . P(T | H) can be taken as identical to P(T), or it can be estimatedfrom past training images by keeping some record of the transformations obtained when matching those

    images.

    The remaining terms of equation 3.2 are the prior and conditional probabilities of various feature matches.

    Before defining those terms, we must introduce notation for some of the specific events within the universe

    of matching outcomes.

    e j = k is the event that model feature j matches image feature ke j = is the event that model feature j matches nothinga j = a is the event that model feature j matches a feature whose attributes are ab j = b is the event that model feature j matches a feature at position b (in model coordinates)

    Conditional probability of a feature match

    For the conditional probabilities, P(ei | T , H), there are two cases to consider.(a) P(e j = k | T , H) is the probability of model feature j matching image feature k in an image of G

    with viewpoint transformation T. Our estimate considers how often j has matched during training,and what attributes and positions its matching features have had.

    P(e j = k | T , H) = P(e j a j = ak b j = T(bk ) | T , H)= P(a j = ak | b j = T(bk ), e j , T , H)

    P( b j = T(bk ) | e j , T , H) P(e j | T , H) .By adopting some additional independence assumptions (e.g., that b j gives no information abouta j ), we obtain

    P(e j = k | T , H) = P(a j = ak | e j , H) P( b j = T(bk ) | e j , T , H) P(e j | H) .The P(a j) and P( b j) terms are estimated using the sequences of attribute vectors and posi-tions recorded with model feature j. The estimate of P(a j) uses the non-parametric density

  • 30

    estimator described in the next section, while the estimate of P( b j) assumes a normal distribu-tion:

    P( b j = b | e j , T , H) = G4 (b j y j , C j ) ,

    where Gd (x, C) =1

    (2pi )d 2 (det C)1 2e

    xC1x

    2

    and yj and C j are the mean vector and covariance matrix for B j , the sequence of positions

    recorded with model feature j.Finally, P(e j | H) is computed according to the frequency with which model feature j wasmatched among training images. We use a Bayesian estimator with uniform prior:

    P(e j | H) = (m j + 1) (m + 2) .

    (3.3)

    (Recall that m is the number of training images that have been incorporated into G , and m j is thenumber of those in which model feature j matched some image feature.)

    (b) P(e j = | T , H) is the probability of model feature j remaining unmatched in an image of G withviewpoint transformation T. It is just 1- P(e j | T , H), where P(e j | T , H) = P(e j | H)under an independence assumption stated earlier, and P(e j | H) was defined by equation 3.3.

    Prior probability of a feature match

    Estimates of the prior probabilities, P(ei ) , are based on measurements of a large, random collection ofimages typical of those in which G will be sought. In this milieu collection G occurs randomly along with

    other views and other objects. The milieu is used to estimate the probability of a feature match occurringregardless of whether G is present. By an analysis similar to that underlying estimates of the conditional

    probabilities, we obtain estimates for the two cases of e i:

    (a) P(e j = k) = P(a j = ak | e j ) P( b j = bk | e j ) P(e j ) . (3.4)(b) P(e j = ) = 1 P(e j ) .

    The P(a j) term in equation 3.4 is estimated using a non-parametric density estimator and samples ofattribute vectors drawn from the milieu collection. The P( b j) term is estimated by assuming that featurepositions are uniformly distributed throughout a bounded region of the model coordinate system. Finally,

    P(e j ) is estimated from the frequency with which features of the same type as j occur in the milieucollection.

  • 31

    3.4.2 Probability density estimation

    To evaluate the match quality measure we have to estimate the probability that a features has a particular

    attribute vector. To help us make this estimate, we have samples of attribute vectors that have been

    acquired by observing the feature in training images. The estimation problem is therefore of the following

    form: given a sequence of d-dimensional vectors x i :1 i n{ } drawn at random from an unknown distri-bution, estimate the probability of a vector x being drawn from the same distribution.

    This could be solved by assuming that the distribution is of some parameterized form (e.g., normal) andusing the x i{ } to estimate those parameters (e.g., the mean and variance). However, the attribute vectordistributions may be complex since they depend not only on sensor noise and measurement errors, but also

    on systematic variations in object shape, lighting, and pose. Therefore we have chosen a nonparametricestimation method described in detail by Silverman (1986). In its simplest, form, this method estimatesprobability density5 by summing contributions from a series of overlapping kernels:

    f (x) = 1n hd

    Kx x i

    h

    i , (3.5)

    where h is a constant smoothing factor, and K is a kernel function. We use the Epanechnikov kernel

    because it has finite support and can be computed quickly; it is defined as

    K x( ) =12 cd

    1 d + 2( ) 1 xTx( ) if xTx < 10 otherwise,

    where cd is the volume of a ddimensional sphere of unit radius. The smoothing factor h appearing in equa-

    tion 3.5 strikes a balance between the smoothness of the estimated distribution and its fidelity to the obser-

    vations, x i{ } . We adjust h using a locallyadaptive method: with f as a first density estimator, we create asecond estimator, f a , whose smoothing factor varies according to the first estimators density estimate:

    f a (x) =1

    n hd i

    d Kx x i

    h i

    i ,

    5 Because the attribute values are continuous, the estimates are of probability densities rather than masses.However, the problem of integrating these densities to obtain masses is neatly avoided since the matchquality measure only requires a ratio of two mass estimates; we can compute this ratio as a ratio of twodensity estimates.

  • 32

    where i =f (x i )

    12

    and = f (x i )i

    1n

    .

    The various i incorporate the first density estimates at the points xi, and is a normalizing factor. This

    adaptive estimator smoothes more in low-density regions than in high-density ones. Thus a sparse outlier is

    thoroughly smoothed while a central peak is accurately represented in the estimate. Figure 3.3 shows an

    example from this estimator.

    3.4.3 Matching procedure

    In matching model and image graphs, we seek a match E and a viewpoint transformation T that maximize

    P(H | E, T ) as defined by equation 3.2. This problem does not appear to have a solution method that is bothfast and exact. One could, for example, enumerate all possible E and, for each one, directly solve for an

    optimal T . This approach is guaranteed to find the truly optimal solution, but because its complexity is

    roughly exponential in the number of model or image features, it is impractical. To considerably shorten the

    search for a solution we use two simplifications:

    (a) The influence of P(T | H) is ignored during the search. Instead, all viewpoint transformations areconsidered to be equally likely.

    (b) We use an iterative alignment method that alternates between extending E and refining T .

    7580

    8590

    95 7580

    8590

    95

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    Angle of junction 1Angle of junction 2

    Probability density

    Figure 3.3. Example of a locally-adaptive probability density estimate for attribute vectors. Thespikes denote the samples from which the density function was estimated.

  • 33

    The iterative alignment method was used by Lowe (1987a, b) in his SCERPO system. It works as follows. Aminimal set of feature matches is first hypothesized and used to estimate a viewpoint transformation. This

    transformation is then used to predict the positions of the remaining model features. For each of these pro-

    jected model features, potential matches with nearby image features are identified and ranked according tolikelihood (how well the two features match) and ambiguity (whether alternative matches appear nearly asgood). The best ranked matches are then adopted and used to refine the estimate of the viewpoint transfor-mation. The process is repeated until acceptable matches have been found for as many of the model features

    as possible. Although backtracking can be used to try alternate matches for projected features, Lowe reportsthat this is seldom necessary because the ranking scheme is usually successful in eliminating ambiguous

    matches, and because errors in the initial estimate of the viewpoint transformation are eliminated as

    additional matches are incorporated.

    In adopting the two simplifications listed above, we are no longer assured that the match solution found

    will be a truly optimal one. However, except in rare circumstances (as when P(T | H) is especially non-uniform), we expect to find a solution that is very close to optimal as Lowe has found that the iterativealignment method is accurate enough for reliable recognition.

    Ranking feature matches

    An initial estimate of the viewpoint transformation can be obtained once one or two feature matches have

    been hypothesized. However, possibilities for these initial feature matches must be evaluated without

    regard to a viewpoint transformation or the positions of the features. We use the following function to

    evaluate a potential match between model feature j and image feature k , using the terms developed insection 3.4.1:

    g j (k) =P(e j = k | H)

    P(e j = k)=

    P(a j = ak | e j , H)P(a j = ak | e j )

    P(e j | H)P(e j )

    .

    Subsequent feature match possibilities are evaluated according to the partial correspondence E and

    estimated viewpoint transformation T obtained from earlier matches. The function for evaluating these

    potential matches is

    g j (k; E, T ) =P(e j = k | T , H)

    P(e j = k)if j, k is consistent with E

    0 otherwise.

  • 34

    For a feature match to receive consideration, it must be consistent with all of the matches adopted up to that

    point. In general this means that the feature must not already have been matched and any relations shared

    with features that are matched must be consistent. However, to cope with some common shortcomings in

    feature detection, these restrictions are relaxed for certain types of features. For example, a models line

    segment feature is permitted to match multiple, collinear image segments so that a failure to detect a few

    edge pixels will be of minimal consequence.

    Estimating the viewpoint transformation

    During the match search, we need to estimate a viewpoint transformation from a set of feature matches. In

    our formulation, the optimal viewpoint transformation consistent with a set of feature matches can be found

    as the solution to a weighted least-squares estimation problem. We find a transformation vector t that

    minimizes W At b( ) 2 , where the matrix A represents the positions of the matched image features inimage coordinates, the vector b represents the positions of the matched model features in model coordi-

    nates, and the matrix W contains weights based on the covariances of model feature position coordinates.

    These objects are defined by

    A =

    1 0 x1 y10 1 y1 x10 0 u1 v10 0 v1 u1

    M

    1 0 xn yn0 1 yn xn0 0 un vn0 0 vn un

    , b =

    b1

    M

    bn

    , and C = WTW =

    1 0

    O

    0 n

    1

    ,

    where i is the full 44 covariance matrix for the sequence of four-element position vectors, B j , recorded

    with the model feature participating in the ith match, ei = j, k . (When early in the learning process B jcontains only a single position vector, i is set to some reasonable default.) The form of C accords withour assumption that features positions are independently distributed with respect to the object. The solutionto the least-squares estimation problem, t , and its covariance P are given by

    ATCA t = ATC b and P = ATCA( )1 .

  • 35

    During the match search, a recursive least-squares estimation procedure can be used to update t at minimal

    cost each time a new feature match is added to E .

    3.4.4 Verification procedure

    Once an optimal match has been found between a model graph and an image graph, it must be decided

    whether the match represents an instance of the modeled object in the image. At this time it is difficult tosay how sophisticated this verification procedure must be if it is to be reliable enough for learning and

    recognition. Our proposal is to begin with a simple approach and elaborate it only as necessary. That simple

    approach, and some of the ways in which it may be enhanced, are outlined in this section.

    The match quality measure used to guide matching provides an estimate of the probability that the match

    result is correct. A simple way to verify the match, then, is to require that this probability exceeds some

    threshold whose value is a system parameter. The parameter essentially determines the degree of caution

    the system is to exercise in declaring recognition results; it can be set to the ratio of the costs of Type II and

    Type I decision errors to produce decisions that minimize the expected error cost.

    However, the match quality measure may prove unsuitable for making verification decisions because it may

    differ too widely among different objects according to what high-level features they have. As explained insection 3.4.1, high-level features that represent groupings of low-level ones violate the feature indepen-

    dence assumption. Consequently, the match quality measure is biased by an amount that depends on what

    high-level feature are present in the model. Whereas this bias may have little effect on the outcome of

    matching any one model graph, it could make it difficult to compare matches among different model

    graphs. If this is the case, it may be better to consider only the lowest-level features, edgels, when using the

    measure as a probability estimate for a verification decision.

    To recognize more than one object in a scene we must accept more than one match for a given image graph.This will entail choosing among competing matches that provide alternate interpretations for some of the

    same image features. However, care must be used when determining whether two matchs interpretations

    conflict over a featuredifferent objects may be permitted to share an edge segment, for example, but not afeature denoting a closed group of edge segments. Once it has been determined which pairs of match inter-

    pretations can coexist, we can obtain a complete interpretation of the scene by selecting a subset of matches

    that is both pairwise consistent and maximizes the overall match quality measure. An iterative, greedy

    algorithm should be adequate for choosing that subset in all but the most elaborate and confusing scenes.

  • 36

    3.5 Model learning procedure

    With each new training image, a models description is updated by a model learning procedure. This proce-

    dure comprises three activities:

    clustering divides training images among clusters corresponding to characteristic views

    generalizing combines the training images of a single cluster to form a model graph

    consolidating simplifies a model graph while retaining essential information

    When a new training image is received the learning procedure must decide whether to assign it to one of the

    existing clusters, or whether to form a new cluster with it. That decision is made by a clustering algorithm,

    which is described below in section 3.5.1. When a training image is assigned to an existing cluster, the gen-

    eralizing algorithm described in section 3.5.2 merges information from the image into the model graph rep-

    resenting the cluster. Additionally, as a model graph grows in complexity, it is consolidated as described in

    3.5.3 to maintain efficiency.

    3.5.1 Clustering training images into characteristic views

    An incremental conceptual clustering algorithm is used to create clusters among the training images as they

    are acquired. Like COBWEB for example (Fisher 1987), the algorithm requires a measure of global cluster-ing quality, which it uses to guide local clustering decisions. We would like that measure to promote two

    somewhat-conflicting qualities while striking a balance between them. On one hand, it should favour clus-

    terings that result in simple, concise, and efficient models, while on the other hand, it should favour cluster-

    ings whose models accurately characterize (or fit) the training data.

    Maximum a posterior (MAP) estimation provides a nice framework for combining these two qualities.Given training data

    X = X{ } for object , our problem is to choose a model M that best characterizes X.

    (In our case, choosing M amounts to choosing a clustering.) We choose an M that maximizes P(M | X),which, by Bayess theorem, also maximizes the product P(M) P(X | M). The terms P(M) and P(X | M)correspond to the simplicity and fit qualities mentioned above.

    Model simplicity term

    The term P(M) represents, in our case, a prior distribution on clusterings. To favour clusterings that resultin simple models, we use the minimum description length principle (Rissanen 1983) to construct this priordistribution. We define an encoding scheme for models that amounts to concisely enumerating the model

  • 37

    graphs, nodes, arcs, attribute vectors, and position vectors with finite precision, using a fixed number of bits

    to encode each component. Then, with L(M) denoting the length of Ms encoding, we use the relationshipestablished by Shannon between probability distributions and minimal-length codings to obtain

    log P(M ) = L(M ), thus defining P(M).

    Data fit term

    The term P(X | M) measures the fit between model and training data.6 To evaluate it we could estimate theprobability of each of the models training images according to the best match between image graph and

    model graph, which is found by searching over all characteristic views and feature matchings:

    P(X | M ) = P(X | M )XX

    ,where P(X | M ) = max

    G M f , E , TP(H | E, T ),

    (3.6)

    P(H | E, T ) is defined by equation 3.2, and the maximum is taken over model graphs G , matchings Ebetween G and X , and transformations T. However, in the course of learning the model each training image

    has already been assigned to a particular characteristic view and matched to the model graph representing

    that view. By assuming that the maximum in equation 3.6 is still attained at these same choices of charac-

    teristic view and matching, we can avoid any search and instead estimate P(X | M) with a simple producttaken over the model graphs, their nodes, and the sequences of attribute vectors and positions recorded by

    those nodes.

    Clustering procedure

    For simplicity, each training image is assigned to exactly one cluster. (A procedure that is capable ofassigning an image to more than one cluster may require less training data in some cases, but it must solve a

    problem of much greater complexity.) Clustering decisions are guided by a desire to maximize P(M | X).The most straightforward clustering strategy assigns each training image to a cluster as it is acquired:

    P(M | X) is estimated for each possible assignment, and the assignment yielding the largest value isadopted.

    6 A requirement for a good model is that there be limited covariation among features within eachcharacteristic view (section 3.2.2). It can be shown that maximizing P(X | M) favours such models.

  • 38

    However, this simple strategy may be sensitive to the order in which training examples are presented. The

    sensitivity may be alleviated by adjusting cluster boundaries, merging clusters, and splitting them in thecourse of learning. Experimental evaluation is needed to determine to what extent order sensitivity is a

    problem, and what strategies are effective in overcoming it. Possible strategies include:

    periodically re-assign the training image, X, that has the lowest P(X | M) with each new training image, consider merging the two clusters best matched by the training image

    to identify possible ways of splitting a cluster, apply a numeric clustering algorithm to the

    collections of attribute vectors recorded for certain model features

    3.5.2 Generalizing training images to form a model graph

    Within each cluster, training images are merged to form a single model graph that represents a generaliza-

    tion of those images. An initial model graph is formed from the first training image assigned to a cluster by

    selecting, from its image graph, the connected subgraph containing the largest-scale token. When a subse-

    quent training image is assigned to the same cluster, an optimal match is found between its image graph and

    the current model graph. Merging is based on that optimal match. Each model node that matches an image

    node is updated by adding the image nodes attribute vector and position to the sequences recorded with the

    model node. New nodes are added to the model graph for those image nodes that werent matched by the

    model.

    3.5.3 Consolidating a model graph

    As information accumulates during learning, it may reveal ways in which a model may be simplified to

    achieve greater efficiency. The system periodically attempts the following simplifications for each evolving

    model graph:

    Rare features are eliminated . The model is pruned of features whose likelihood of being detected,P(e j | H) , has dropped below some threshold.

    Attribute vector distributions are consolidated . The time and space complexity of the non-paramet-

    ric density estimators used to approximate attribute vector distributions increase linearly with the

    number of training samples. Thus, to improve efficiency, a distribution represented by a large

    number of samples is replaced with a parametric estimator, provided that the parametric estimator

    approximates the sampled distribution well enough. A goodness-of-fit test determines when the

    substitution can be made.

  • 39

    3.6 Summary

    To handle variation in object appearance this approach relies on finding an abundance of features fordescribing that appearance. In practice this should work because most objects have many local features thatcan be used to identify them. An image is described in terms of features chosen for stability and signifi-

    cance; by retaining some redundancy among those features, we obtain a collection whose stability exceeds

    that of its individual members.

    The model represents some information about how an object has been observed to vary in appearance.Because this information is recorded independently for each feature, the matching and learning procedures

    remain simple and efficient. However, for the s