24
From Computer Vision Systems, A Hanson & E. Riseman (Eds.), pp. 3-26, New York: Academic Press (1978). COMPUTER VISION SYSTEMS RECOVERING INTRINSIC SCENE CHARACTERISTICS FROM IMAGES H.G. Barrow and J.M. Tenenbaum, SRI International, Menlo Park, CA 94025. ABSTRACT We suggest that an appropriate role of early visual processing is to describe a scene in terms of intrinsic (vertical) characteristics -- such as range, orientation, reflectance, and incident illumination -- of the surface element visible at each point in the image. Support for this idea comes from three sources: the obvious utility of intrinsic characteristics for higher-level scene analysis; the apparent ability of humans to determine these characteristics, regardless of viewing conditions or familiarity with the scene; and a theoretical argument that such a description is obtainable, by a noncognitive and nonpurposive process, at least, for simple scene domains. The central problem in recovering intrinsic scene characteristics is that the information is confounded in the original light-intensity image: a single intensity value encodes all the characteristics of the corresponding scene point. Recovery depends on exploiting constraints, derived from assumptions about the nature of the scene and the physics of the imaging process. I INTRODUCTION Despite corsiderable progress in recent years, our understanding of the principles underlying visual perception remains primitive. Attempts to construct computer models for the interpretation of arbitrary scenes have resulted in such poor performance, limited range of abilities, and inflexibility that, were it not for the human existence proof, we might have been tempted long ago to conclude that high-performance, general-purpose vision is impossible. On the other hand, attempts to unravel the mystery of human vision, have resulted in a limited understanding of the elementary neurophysiology, and a wealth of phenomenological observations of the total system, but not, as yet, in a cohesive model of how the system functions. The time is right for those in both fields to take a broader view: those in computer vision might do well to look harder at the phenomenology of human vision for clues that might indicate fundamental inadequacies of current aproaches; these concerned with human vision might gain insights by thinking more about what information is sought, and how it might be obtained, from a computational point of view. This position has been strongly advocated for some time by Horn [18-20] and Marr [26-29] at MIT. Current scene analysis systems often use pictorial features, such as regions of uniform intensity, or step charges in intensity, as an initial level of description and then jump directly to descriptions at the level of complete objects. The limitations of this approach are well known [4]: first, region- growing and edge-finding programs are unreliable in extracting the features that correspond to object surfaces because they have no basis for evaluating which intensity differences correspond to scene events sig- nificant at the level of objects (e.g., surface boundaries) and which do not (e.g., shadows). Second, matching pictorial features to a large number of object models is difficult and potentially combinatorially explosive because the feature descriptions are impoverished and lack invariance to viewing conditions. Finally, such systems cannot cope with objects for which they have no explicit model. Some basic deficiencies in current approaches to machine vision are suggested when one examines the known behavior and competence of the human visual system. The literature abounds with examples of the ability of people to estimate characteristics intrinsic to the scene, such as color, orientation, distance, size, shape, or illumination, throughout a wide range of viewing conditions. Many experiments have been performed to determine the scope of so-called “shape constancy,” “size constancy,” and “color constancy” [13 and 14]. What is particularly remarkable is that consistent judgements can be made despite the fact that these characteristics interact strongly in determining intensities in the image. For example, reflectance can be estimated over an extraordinarily wide range of incident illumination: a black piece of paper in bright sunlight may reflect more light than a white piece in shadow, but they are still perceived as black and white respectively. Color also appears to remain constant throughout wide variation in the spectral composition of incident illumination. Variations in incident illumination are independently perceived: shadows are usually easily distinguished from changes in reflectance. Surface shape, too, is easily discerned regardless of illumination or surface markings: Yonas has experimentally determined that human accuracy in estimating local surface orientation is about eight degrees [37]. It is a worthwhile exercise at this point to pause and see how easily you can infer intrinsic characteristics, like color or surface orientation, in the world around you. Copyright © 1978 by Academic Press, Inc. All rights of reproduction in any form reserved. ISBN 0-12-323550-2 3

COMPUTER VISION SYSTEMS - MITweb.mit.edu/cocosci/Papers/Barrow-Tenenbaum78.pdf · From Computer Vision Systems, A Hanson & E. Riseman (Eds.), pp. 3-26, New York: Academic Press (1978)

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: COMPUTER VISION SYSTEMS - MITweb.mit.edu/cocosci/Papers/Barrow-Tenenbaum78.pdf · From Computer Vision Systems, A Hanson & E. Riseman (Eds.), pp. 3-26, New York: Academic Press (1978)

From Computer Vision Systems, A Hanson & E. Riseman (Eds.), pp. 3-26, New York: Academic Press (1978).

COMPUTER VISION SYSTEMS

RECOVERING INTRINSIC SCENE CHARACTERISTICS FROM IMAGES

H.G. Barrow and J.M. Tenenbaum,SRI International,

Menlo Park, CA 94025.

ABSTRACT

We suggest that an appropriate role ofearly visual processing is to describe a scenein terms of intrinsic (vertical)characteristics -- such as range, orientation,reflectance, and incident illumination -- ofthe surface element visible at each point inthe image. Support for this idea comes fromthree sources: the obvious utility of intrinsiccharacteristics for higher-level sceneanalysis; the apparent ability of humans todetermine these characteristics, regardless ofviewing conditions or familiarity with thescene; and a theoretical argument that such adescription is obtainable, by a noncognitiveand nonpurposive process, at least, for simplescene domains. The central problem inrecovering intrinsic scene characteristics isthat the information is confounded in theoriginal light-intensity image: a singleintensity value encodes all the characteristicsof the corresponding scene point. Recoverydepends on exploiting constraints, derived fromassumptions about the nature of the scene andthe physics of the imaging process.

I INTRODUCTION

Despite corsiderable progress in recentyears, our understanding of the principlesunderlying visual perception remains primitive.Attempts to construct computer models for theinterpretation of arbitrary scenes haveresulted in such poor performance, limitedrange of abilities, and inflexibility that,were it not for the human existence proof, wemight have been tempted long ago to concludethat high-performance, general-purpose visionis impossible. On the other hand, attempts tounravel the mystery of human vision, haveresulted in a limited understanding of theelementary neurophysiology, and a wealth ofphenomenological observations of the totalsystem, but not, as yet, in a cohesive model ofhow the system functions. The time is right forthose in both fields to take a broader view:those in computer vision might do well to lookharder at the phenomenology of human vision forclues that might indicate fundamentalinadequacies of current aproaches; theseconcerned with human vision might gain insightsby thinking more about what information issought, and how it might be obtained, from acomputational point of view. This position hasbeen strongly advocated for some time by Horn[18-20] and Marr [26-29] at MIT.

Current scene analysis systems often usepictorial features, such as regions of uniformintensity, or step charges in intensity, as aninitial level of description and then jumpdirectly to descriptions at the level ofcomplete objects. The limitations of thisapproach are well known [4]: first, region-growing and edge-finding programs areunreliable in extracting the features thatcorrespond to object surfaces because they haveno basis for evaluating which intensitydifferences correspond to scene events sig-nificant at the level of objects (e.g., surfaceboundaries) and which do not (e.g., shadows).Second, matching pictorial features to a largenumber of object models is difficult andpotentially combinatorially explosive becausethe feature descriptions are impoverished andlack invariance to viewing conditions. Finally,such systems cannot cope with objects for whichthey have no explicit model.

Some basic deficiencies in currentapproaches to machine vision are suggested whenone examines the known behavior and competenceof the human visual system. The literatureabounds with examples of the ability of peopleto estimate characteristics intrinsic to thescene, such as color, orientation, distance,size, shape, or illumination, throughout a widerange of viewing conditions. Many experimentshave been performed to determine the scope ofso-called “shape constancy,” “size constancy,”and “color constancy” [13 and 14]. What isparticularly remarkable is that consistentjudgements can be made despite the fact thatthese characteristics interact strongly indetermining intensities in the image. Forexample, reflectance can be estimated over anextraordinarily wide range of incidentillumination: a black piece of paper in brightsunlight may reflect more light than a whitepiece in shadow, but they are still perceivedas black and white respectively. Color alsoappears to remain constant throughout widevariation in the spectral composition ofincident illumination. Variations in incidentillumination are independently perceived:shadows are usually easily distinguished fromchanges in reflectance. Surface shape, too, iseasily discerned regardless of illumination orsurface markings: Yonas has experimentallydetermined that human accuracy in estimatinglocal surface orientation is about eightdegrees [37]. It is a worthwhile exercise atthis point to pause and see how easily you caninfer intrinsic characteristics, like color orsurface orientation, in the world around you.

Copyright © 1978 by Academic Press, Inc.

All rights of reproduction in any form reserved.

ISBN 0-12-323550-23

Page 2: COMPUTER VISION SYSTEMS - MITweb.mit.edu/cocosci/Papers/Barrow-Tenenbaum78.pdf · From Computer Vision Systems, A Hanson & E. Riseman (Eds.), pp. 3-26, New York: Academic Press (1978)

The ability of humans to estimateintrinsic characteristics does not seem torequire familiarity with the scene, or withobjects contained therein. One can formdescriptions of the surfaces in scenes unlikeany previously seen, even when the presentationis as unnatural as a photograph. People canlook at photomicrographs, abstract art, orsatellite imagery, and make consistentjudgements about relative distance,orientation, transparency, reflectance, and soforth. See, for example, Figure 1, from athesis by Macleod [25].

Looking beyond the phenomenologicalaspects, one might ask what is the value ofbeing able to estimate such intrinsiccharacteristics. Clearly, some information isvaluable in its own right: for example, knowingthe three-dimensional structure of the scene isfundamental to many activities, particularly tomoving around and manipulating objects in theworld. Since intrinsic characteristics give amore invariant and more distinguishingdescription of surfaces than raw lightintensities, they greatly simplify many basicperceptual operations. Scenes can bepartitioned into regions that correspond tosmooth surfaces of uniform reflectance, andviewpoint-independent descriptions of thesurfaces may then be formed [29]. Objects maybe described and recognized in terms ofcollections of these elementary surfaces, withattributes that are characteristic of theircomposition or function, and relationships thatconvey structure, and not merely appearance. Achair, for example, can be describedgenerically as a horizantal surface, at anappropriate height for sitting, and a verticalsurface situated to provide back support.Previously unknown objects can be described interms of invariant surface characteristics, andsubsequently recognized from other viewpoints.

A concrete example of the usefulness ofintrinsic scene information in computer visioncan be obtained from experiments by Nitzan,Brain and Duda [30] with a laser rangefinderthat directly measures distance and apparentreflectance. Figure 2a shows a test scene takenwith a normal camera. Note the variation inintensity of the wall and chart due tovariations in incident illumination, eventhough the light sources are extended anddiffuse. The distance and reflectance for thisscene is obtained by the rangefinder are shownin Figure 2b. The distance information is shownin a pictorial representation in which closerpoints appear brighter. Note that, except for aslight amount of crosstalk on the top of thecart, the distance image is insensitive toreflectance variations. The laser images arealso entirely free from shadows.

Using the distance information, it isrelatively straightforward to extract regionscorresponding to flat or smooth surfaces, as inFig. 2c, or edges corresponding to occlusionboundaries, as in Figure 2d, for example. Usingreflectance information, conventional region-or edge-finding programs show considerable im-provement in extracting uniformly painted sur-faces. Even simple thresholding extracts accep-table surface approximations, as in Figure 2e.

Since we have three-dimensional informa-tion, matching is now facilitated. For example,given the intensity data of a planar surface

that is not parallel to the image plane, we caneliminate the projective distortion in thesedata to obtain a normal view of this surface,Figure 2f. Recognition of the characters isthereby simplified. More generally, it is nowpossible to describe objects generically, as inthe chair example above. Garvey [10] actuallyused generic descriptions at this level tolocate objects in rangefinder images of officescenes.

The lesson to be learned from thisexample is that the use of intrinsiccharacteristics, rather than intensity values,alleviates many of the difficulties that plaguecurrent machine vision systems, and to whichthe human visual system is apparently largelyimmune.

The apparent ability of people toestimate intrinsic characteristics in unfamil-iar scenes and the substantial advantages thatsuch characteristics would provide stronglysuggest that a visual system, whether for ananimal or a machine, should be organized aroundan initial level of domain-independentprocessing, the purpose of which is therecovery of intrinsic scene characteristicsfrom image intensities. The next step inpursuing this idea is to examine in detail thecomputational nature of the recovery process todetermine whether such a design is reallyfeasible.

In this paper, we will first establishthe true nature of the recovery problem, anddemonstrate that recovery is indeed possible,up to a point, in a simple world. We will thenargue that the approach can be extended, in astraightforward way, to more realistic scenedomains. Finally, we will discuss this paradigmand its implications in the context or currentunderstanding of machine and human vision. Forimportant related work see [29].

II THE NATURE OF THE PROBLEM

The first thing we must do is specifyprecisely the objectives of the recoveryprocess in terms of input and desired output.

The input is one or more imagesrepresenting light intensity values, fordifferent viewpoints and spectral bands. Theoutput we desire is a family of images for eachviewpoint. In each family there is one imagefor each intrinsic characteristic, all inregistration with the corresponding inputimages. We call these images “IntrinsicImages.” We want each intrinsic image tocontain, in addition to the value of thecharacteristic at each point, explicitindications of boundaries due to discon-tinuities in value or gradient. The intrinsicimages in which we are primarily interested areof surface reflectance, distance or surfaceorientation, and incident illumination. Othercharacteristics, such as transparency,specularity, luminosity, and so forth, mightalso be useful as intrinsic images, either intheir own right or as intermediate results.

Figure 3 gives an example of one possibleset of intrinsic images corresponding to asingle, monochrome image of a simple scene. Theintrinsic images are here represented as linedrawings, but in fact would contain numericalvalues at every point. The solid lines show

4 H. G. Barrow and J. M. Tenenbaum

Page 3: COMPUTER VISION SYSTEMS - MITweb.mit.edu/cocosci/Papers/Barrow-Tenenbaum78.pdf · From Computer Vision Systems, A Hanson & E. Riseman (Eds.), pp. 3-26, New York: Academic Press (1978)

(a) CASTANOPSIS (X 3500)

(c) FLAX (X 1000)

(b) DRIMYS (X 3200)

(d) WALLFLOWER (X 1800)

Figure 1 Photomicrographs of pollen grains (Macleod [20])

Characteristics from Images 5

Page 4: COMPUTER VISION SYSTEMS - MITweb.mit.edu/cocosci/Papers/Barrow-Tenenbaum78.pdf · From Computer Vision Systems, A Hanson & E. Riseman (Eds.), pp. 3-26, New York: Academic Press (1978)

(a) THRESHOLDING REFLECTANCE (b) CORRECTED VIEW OF CART TOP

Figure 2 Experiments with a laser rangefinger

6 H. G. Barrow and J. M. Tenenbaum

Page 5: COMPUTER VISION SYSTEMS - MITweb.mit.edu/cocosci/Papers/Barrow-Tenenbaum78.pdf · From Computer Vision Systems, A Hanson & E. Riseman (Eds.), pp. 3-26, New York: Academic Press (1978)

(a) ORIGINAL SCENE

(b) DISTANCE

(d) ORIENTATION (VECTOR)

Figure 3 A set of intrinsic images derived froma single monochrome intensity image

The images are depicted as linedrawings, but, in fact, would containvalues at every point. The solid linesin the intrinsic images represent dis-continuities in the scene characteris-tic; the dashed lines representdiscontinuities in its derivative.

(c) REFLECTANCE

(e) ILLUMINATION

Characteristics from images 7

Page 6: COMPUTER VISION SYSTEMS - MITweb.mit.edu/cocosci/Papers/Barrow-Tenenbaum78.pdf · From Computer Vision Systems, A Hanson & E. Riseman (Eds.), pp. 3-26, New York: Academic Press (1978)

discontinuities in the represented char-acteristic, and the dashed lines showdiscontinuities in its gradient. In the inputimage, intensities correspond to the reflectedflux received from the visible points in thescene. The distance image gives the range alongthe line of sight from the center of projectionto each visible point in the scene. Theorientation image gives a vector representingthe direction of the surface normal at eachpoint. It is essentially the gradient of thedistance image. The short lines in this imageare intended to convey to the reader thesurface orientation at a few sample points.(The distance and orientation images correspondto Marr's notion of a 2.5D sketch [29].) It isconvenient to represent both distance andorientation explicitly, despite the redundancy,since some visual cues provide evidenceconcerning distance and other evidenceconcerning orientation. Moreover, each form ofinformation may be required by some higher-level process in interpretation or action. Thereflectance image gives the albedo (the ratio-of total reflected to total incidentillumination) at each point. Albedo completelydescribes the reflectance characteristics forlambertian (perfectly diffusing) surfaces, in aparticular spectral band. Many surfaces areapproximately lambertian over a range ofviewing conditions. For other types of surface,reflectance depends on relative directions ofincident rays, surface normal and reflectedrays. The illumination image gives the totallight flux incident at each point. In general,to completely describe the incident light it isnecessary to give the incident flux as afunction of direction. For point light sources,one image per source is sufficient, if weignore secondary illumination by lightscattered from nearby surfaces.

Figure 4 An ideally diffusing surface

When an image is formed, by a camera orby an eye, the light intensity at a point inthe image is determined mainly by three factors

at the corresponding point in the scene: theincident illumination, the local surfacereflectance, and the local surface orientation.In the simple case of an ideally diffusingsurface illuminated by a point source, as inFigure 4, for example, the image lightintensity, L, is given by

L = I * R * cos i (1)

where I is intensity of incident illumination,R is reflectivity of the surface, and i is theangle of incidence of the illumination [20].

The central problem in recoveringintrinsic scene characteristics is thatinformation is confounded in the light-intensity image: a single intensity valueencodes all the intrinsic attributes of thecorresponding scene point. While the encodingis deterministic and founded upon the physicsof imaging, it is not unique: the measuredlight intensity at a single point could resultfrom any of an infinitude of combinations ofillumination, reflectance, and orientation.

We know that information in the intrinsicimages completely determines the input image.The crucial question is whether the informationin the input image is sufficient to recover theintrinsic images.

III THE NATURE OF THE SOLUTION

The only hope of decoding the confoundedinformation is, apparently, to make assumptionsabout the world and to exploit the constraintsthey imply. In images of three-dimensionalscenes, the intensity values are notindependent but are constrained by variousphysical phenomena. Surfaces are continuous inspace, and often have approximately uniformreflectance. Thus, distance and orientation arecontinuous, and reflectance is constanteverywhere in the image, except at edgescorresponding to surface boundaries. Incidentillumination, also, usually varies smoothly.Step changes in intensity usually occur atshadow boundaries, or surface boundaries.Intrinsic surface characteristies arecontinuous through shadows. In man-madeenvironments, straight edges frequencycorrespond to boundaries of planar surfaces,and ellipses to circles viewed obliquely. Manyclues of this sort are well known topsychologists and artists. There are alsohigher-level constraints based on knowledge ofspecific objects, or classes of object, but weshall not concern ourselves with them here,since our aim is to determine how well imagescan be interpreted without object-levelknowledge.

We contend that the constraints providedby such phenomena, in conjunction with thephysics of imaging, should allow recovery ofthe intrinsic images from the input image. Asan example, look carefully at a nearby paintedwall. Observe that its intensity is notuniform, but varies smoothly. The variationcould be due, in principle, to variations inreflectance, illumination, orientation, or anycombination of them. Assumptions of continuityimmediately rule out the situation of a smoothintensity variation arising from cancellingrandom variations in illumination, reflectance,and orientation since surfaces are assumed to

8 H. G. Barrow and J. M. Tenenbaum

Page 7: COMPUTER VISION SYSTEMS - MITweb.mit.edu/cocosci/Papers/Barrow-Tenenbaum78.pdf · From Computer Vision Systems, A Hanson & E. Riseman (Eds.), pp. 3-26, New York: Academic Press (1978)

be uniform in reflectance, the intensityvariation must thus be due to a smoothvariation in illumination or surface shape. Thestraight edge of the wall suggests, however,that the wall is planar, and that the variationis in illumination only. To appreciate thevalue of this constraint, view a small centralportion of the wall through a tube. With noevidence from the edge, it is difficult todistinguish whether the observed shading is dueto an illumination gradient on a planarsurface, or to a smooth surface curving awayfrom the light source.

The tube experiment shows that whileisolated fragments of an image have inherentambiguity, interactions among fragmentsresulting from assumed constraints can lead toa unique interpretation of the whole image. Ofcourse, it is possible to construct (oroccasionally to encounter) scenes in which theobvious assumptions are incorrect -- forexample, an Ames roon (see [13] for anillustration). In such cases, the image will bemisinterpreted, resulting in an illusion. TheAmes illusion is particularly interestingbecause it shows the lower-levelinterpretation, of distance and orientation,dominating the higher-level knowledge regardingrelative sizes of familiar objects, and evendominating size constancy. Fortunately, innatural scenes, as commonly encountered, theevidence is usually overwhelmingly in favor ofthe correct interpretation.

We have now given the flavor of thesolution, but with many of the details lacking.Our current research is aimed at making theunderlying ideas sufficiently precise toimplement a computational model. While we arefar from ready to attack the full complexity ofthe real world, we can give a fairly precisedescription of such a model for recoveringintrinsic characteristics in a limited world.Moreover, we can argue that this model may beextended incrementally to handle more realisticscenes.

IV SOLUTION FOR A SIMPLE WORLD

A. Methodology To approach the problem systematically,

we select an idealized domain in which asimplified physics holds exactly, and in whichexplicit contraints on the nature of surfacesand illuminants. From these assumptions, it ispossible to enumerate various types of scenefragments and determine the appearance of theircorresponding image fragments. A catalog offragment appearances and alternativeinterpretations can thus be compiled (in thestyle of Huffman [21] and Waltz [34]).

We proceed by constructing specificscenes that satisfy the domain assumptions,synthesizing corresponding images of them, andthen attempting to recover intrinsiccharacteristics from the images, using thecatalog and the domain knowledge. (Bydisplaying synthetic images, we could checkthat people can interpret them adequately. Ifthey cannot, we can discover oversimplifica-tions by comparing the synthetic images to realimages of similar scenes.)

B. Selection of a Domain Specifications for an experimental domain

must include explicit assumptions regarding thescene, the illumination, the viewpoint, thesensor, and the image-encoding process. Theinitial domain should be sufficiently simple toallow exhaustive enumeration of itsconstraints, and complete cataloging ofappearances. It must, however, be sufficientlycomplex so that the recovery process is non-trivial and generalizable. A domain satisfyingthese requirements is defined as follows:

* Objects are relatively smooth, havingsurfaces over which distance andorientation are continuous. That is,there are no sharp edges or creases.

* Surfaces are lambertian reflectors,with constant albedo over them. Thatis, there are no surface markings andno visible texture.

* Illumination is from a distant pointsource, of known magnitude anddirection, plus uniformly diffusebackground light of known magnitude (anapproximation to sun, sky, andscattered light). Local secondaryillumination (light reflected fromnearby objects) is assumed to benegligible. (See Figure 5.) ;

* The image is formed by centralprojection onto a planar surface. Onlya single view is available (no stereoor motion parallax). The scene isviewed from a general position(incremental changes in viewpoint donot change the topology of the image) .

* The sensor measures reflected fluxdensity. Spatial and intensityresolution are sufficiently high thatquartization effects may be ignored.Sensor noise is also negligible.

Such a domain might be viewed as anapproximation of a world of colored Play-Dohobjects in which surfaces are smooth,reflectance is uniform for each object, thereis outdoor illumination, and the scene isimaged by a tv camera. The grossestapproximations, perhaps, are the assumptionsabout illumination, but they are substantiallymore realistic than the usual single-point-source model, which renders all shadowedregions perfectly black.

For this domain, our objective is torecover intrinsic images of distance,orientation, reflectance, and illumination.

C. Describing the Image Elementary physical considerations show

that a portion of a surface that is continuousin visibility, distance, orientation, andincident illumination, and has uniformreflectance, maps to a connected region ofcontinuous intensity in the image. Images thusconsist of regions of smoothly varyingintensity, bounded by step discontinuities. Inour domain, reflectance is constant over eachsurface, and there are two states ofillumination, corresponding to sun and shadow.Image regions therefore correspond to areas ofsurface with a particular state ofillumination, and the boundaries corresponding

Characteristics from Images 9

Page 8: COMPUTER VISION SYSTEMS - MITweb.mit.edu/cocosci/Papers/Barrow-Tenenbaum78.pdf · From Computer Vision Systems, A Hanson & E. Riseman (Eds.), pp. 3-26, New York: Academic Press (1978)

Figure 5 Sun and sky illumination model

to occluding (extremal) boundaries ofsurfaces, or to the edges of shadows. Thereare also junctions where boundaries meet.Figure 6b shows the regions and edges for thesimple scene of Figure 3.

To be quantitative, we assume imageintensity is calibrated to give reflected fluxdensity at the corresponding scene point.Reflected flux density is the product ofintegrated incident illumination, I, andreflectance (albedo), R, at a surface element.Thus,

L = I * R (2)

The reflected light is distributed uniformlyover a hemisphere for a lambertian surface.Hence, image intensity is independent ofviewing direction. It is also independent ofviewing distance, because although the fluxdensity received from a unit area of surfacedecreases as the inverse square of distance,the surface area corresponding to a unit areain the image increases as the square ofdistance.

In shadowed areas of our domain, wheresurface elements are illuminated by uniformdiffuse illumination of total incident flux density I0, the image intensity is givenby

L = IO * R (3)

When a surface element is illuminated bya point source, such that the flux density is I1, from a direction specified by the unit

vector, S, the incident flux density at thesurface is I1 * N.S, where N is the unitnormal to the surface, and . is the vector dotproduct. Thus,

L = I1 * N.S * R (4)

In directly illuminated areas of the scene,image intensity, L, is given by the sum of thediffuse and point-source components:

L = (I0 + I1 + N.S) * R (5)

From the preceding sections, we are not in aposition to describe the appearance of imagefragments in our domain, and then to derive acatalog.

1. Regions For a region corresponding to a

directly illuminated portion of a surface,since R, I0, and I1 are constant, anyvariation in image intensity is due solely tovariation in surface orientation. For a regioncorresponding to a shadowed area of surface,intensity is simply proportional toreflectance, and hence is constant over thesurface.

We now catalog regions by theirappearance. Regions can be classifiedinitially according to whether theirintensities are smoothly varying, or constant.In the former case, the region must correspondto a nonshadowed, curved surface with constantreflectance and continuous depth andorientation. In the latter case, it mustcorrespond to a shadowed surface. (Anilluminated planar surface also has constantintensity, but such surfaces are excluded fromour domain.) The shadowing may be due eitherto a shadow cast upon it, or to its facingaway from the point source. The shape of ashadowed region is indeterminable fromphotometric evidence. The surface may containbumps or dents and may even containdiscontinuities in orientation and depthacross a self-occlusion, with no correspondingintensity variations in the image.

2. Edges In the same fashion as for regions,

we can describe and catalog region boundaries(edges). An edge should not be consideredmerely as a step change in image intensity,but rather as an indication of one of severaldistinct scene events. In our simple world,edges correspond to either the extremal boun-dary of a surface (the solid lines in Figure3b), or to the boundary of a cast shadow (thesolid lines in Figure 3e). The "terminator"line on a surface, where there is a smoothtransition from full illumination to self-shadowing (the dashed lines in Figure 3e),does not produce a step change in intensity

The boundary of a shadow cast on asurface indicates only a difference inincident illumination: the intrinsiccharacteristics of the surface are continuousacross it. As we observed earlier, theshadowed region is constant in intensity, andthe illuminated region has an intensitygradient that is a function of the surfaceorientation. The shadowed region isnecessarily darker than the illuminated one.

10 H. G. Barrow and J. M. Tenenbaum

Page 9: COMPUTER VISION SYSTEMS - MITweb.mit.edu/cocosci/Papers/Barrow-Tenenbaum78.pdf · From Computer Vision Systems, A Hanson & E. Riseman (Eds.), pp. 3-26, New York: Academic Press (1978)

(a) ORIGINAL SCENE

(c) LA: CONSTANT, LB: CONSTANT

(e) LA: CONSTANT, LB: TANGENT

(b) INPUT INTENSITY IMAGE

(d) LA: CONSTANT, LB: VARYING

(f) LA: VARYING, LB: TANGENT

Figure 6 Initial classification of edges in an example scene.

Characteristics from Images 11

Page 10: COMPUTER VISION SYSTEMS - MITweb.mit.edu/cocosci/Papers/Barrow-Tenenbaum78.pdf · From Computer Vision Systems, A Hanson & E. Riseman (Eds.), pp. 3-26, New York: Academic Press (1978)

An extremal boundary is a local ex-tremum of the surface from the observer's pointof view, where the surface turns away from him.Here one surface occludes another, and allintrinsic characteristics may be discontinuous.In our world, it is assumed that depth andorientation are always discontinuous. Reflec-tance is constant on each side of the edge, andwill only be continuous across it if the twosurfaces concerned have identical reflectance,or when a single surface occludes itself.

A very important condition resultsfrom the fact that an extremal boundaryindicates where a smooth surface turns awayfrom the viewer: the orientation is normal tothe line of sight, and to the tangent to theboundary in the image. Hence the absoluteorientation of the occluding surface can bedetermined along an extremal boundary. Theboundary must first be identified as extremal,however.

The regions on either side of anextremal boundary are independentlyilluminated. When both are in shadow, they bothhave constant intensity, and the ratio ofintensities is equal to the ratio of thereflectances: it is not possible to determinefrom local intensity information which regioncorresponds to the occluding surface.

When a region is illuminated,intensity varies continuously along its side ofthe boundary. We noted that the image of anextremal boundary tells us precisely theorientation of the occluding surface at pointsalong it. The orientation, together with tbeillumination flux densities, I0 and I1, and theimage intensity, L, can be used in Equation (5)to determine reflectance at any point on theboundary. For a true extremal boundary, ourassumption of uniform reflectance means thatestimates of reflectance at all points along itmust agree. This provides a basis forrecognizing an occluding surface by testing

whether reflectances derived at severalboundary points are consistent. We call thistest the tangency test , because it dependsupon the surface being tangential to the lineof sight at the boundary. This test is derivedby differentiating the logarithm of Equation(5):

dL/L = dR/R + (I1*dN.s)/(I0 + I1*N.s) (6)

The vector dN is the derivative of surfaceorientation along the edge, and may bedetermined from the derivative of edgedirection. The derivatives dL and dR are takenalong the edge. Equation (6) may be rewrittento give dR/R explicitly in terms of L, dL, N,dN and the constants I0, I1, and S. Thetangency condition is met when dR/R is zero.The tangency condition is a powerful constraintthat can be exploited in further ways, which wewill discuss later.

Strictly speaking, where we havereferred to derivatives here, we should havesaid "the knit of the derivative as the edge isapproached from the side of the region beingtested." Clearly, the tests are not applicableat gaps, and the tangency test is notapplicable where the edge is discontinuous indirection.

We can now catalog edges by theirappearances, as we did for regions. Edges areclassified according to the appearance of theregions on either side in the vicinity of theedge. This is done by testing intensity valueson each side for constancy, as before, and forsatisfaction of the tangency test, and bytesting relative intensities across the edge.Table 1 catalogs the possible appearances andinterpretations of an edge between two regions,A and B.

In this table, "Constant" meansconstant intensity along the edge, "Tangency"means that the tangency condition is met, and

Table 1 The Nature of Edges

12 H. G. Barrow and J. M. Tenenbaum

Page 11: COMPUTER VISION SYSTEMS - MITweb.mit.edu/cocosci/Papers/Barrow-Tenenbaum78.pdf · From Computer Vision Systems, A Hanson & E. Riseman (Eds.), pp. 3-26, New York: Academic Press (1978)

"Varying" means that neither of these testssucceeds. The entry "EDGE" denotes adiscontinuity in the corresponding intrinsicattribute, with the same location and directionas the corresponding image intensity edge. Themagnitude and sense of the discontinuity areunknown, unless otherwise shown. Where thevalue of an intrinsic attribute can bedetermined from the image (see Section IV.D),it is indicated by a term of the form RA, RB,DA, etc. (These terms denote values forreflectance, R; orientation, vector N;distance, D; and incident flux density, I; forthe regions A and B, in the obvious way.) Whereonly a constraint on values is known, it isindicated by an inequality. There is a specialsituation, concerning case 1 of the second typeof edge, in which a value can be determined forNB.S, but not for NB itself.

Note that, from the types of intensityvariations, edges can be interpretedunambiguously, except for two cases: namely,the sense of the occluding edge between twoshadowed regions, and the interpretation of anedge between illuminated and shadowed regionswhen the tangency test fails. Figure 6illustrates the classification of edgesaccording to tbe catalog for our example scene.

3. Junctions

Since the objects in our domain aresmooth, there are no distinguished points onsurfaces. Junctions in the image, therefore,are viewpoint dependent. There are just twoclasses of junction, both resulting from anextremal boundary, and both appearing as a T-shape in the image (see Figure 7).

The first type of junction arises whenone object partially occludes a more distantboundary, which may be either a shadow edge oran extremal edge. The crossbar of the junctionis thus an extremal edge of an object that iseither illuminated or shadowed. The boundaryforming the stem lies on the occluded objectand its edge type is unconstrained.

The second type of junction ariseswhen a shadow cast on a surface continuesaround behind an extremal boundary. In thiscase, the crossbar is again an extremal edge,half in shadow, while the stem is a shadow edgelying on the occluding object.

Note that in both cases the crossbarof the T corresponds to a continuous extremalboundary. Hence the two edges forming thecrossbar are continuous, occluding, and havethe same sense.

The T junctions provide constraintsthat can sometimes resolve the ambiguities inthe edge table above. Consider the cases asfollows: if all the regions surrounding the Tare shadowed, the edge table tells us that allthe edges are occluding, but their senses areambiguous. The region above the crossbar,however, must be occluding the other two,otherwise the continuity of the crossbar couldbe due to an accident of viewpoint. If one ormore of the regions is illuminated, theocclusion sense of the crossbar is immediatelydetermined from the tangency test. Thus we canalways determine the nature of the two edgesforming the crossbar of the T, even wnen theymay have been ambiguous according to the edgetable.

If the region above the crosbar is theoccluder, we have the first type of T junction,and can say no more about the stem than theedge tests give us. Otherwise, we have thesecond type (a shadow edge cast over anextremal boundary), and any ambiguity of thestem is now resolved.

Figure 7 Two types of T-junction

D. Recovery Using the Catalog

The ideal goal is to recover allintrinsic scene characteristics, exactly,everywhere in an image that is consistent withour domain. In this section, we outline theprinciples of recovery using the catalog andaddress the issue of how nearly our goal can beattained. The following section will describe adetailed computational model of the recoveryprocess.

The recovery process has four main steps:

(1) Find the step edges in the inputintensity image.

(2) Interpret the intrinsic nature of theregions and edge elements, accordingto the catalog. Interpretation isbased on the results of constancy andtangency tests.

(3) Assign initial values for intrinsiccharacteristics along the edges,based on the interpretations.

(4) Propagate these "boundary" valuesinto the interiors of regions, using

Characteristics from Images 13

Page 12: COMPUTER VISION SYSTEMS - MITweb.mit.edu/cocosci/Papers/Barrow-Tenenbaum78.pdf · From Computer Vision Systems, A Hanson & E. Riseman (Eds.), pp. 3-26, New York: Academic Press (1978)

continuity assumptions. This step isanalogous to the use of relaxationmethods in physics for determiningdistributions of temperature orpotential over a region based onboundary values.

For pedagogical purposes, in this sectiononly, we assune that it is possible to extracta perfect line drawing from the intensityimage. The recovery process described laterdoes not depend on this assumption, because: itperfects the line drawing as an integral partof its interpretation. Let us now consider theultimate limitations of the recovery paradigmin our simple domain.

Shadowed and directly illuminated areasof the image are distinguished immediately,using the constancy test. Reflectanceeverywhere in shadowed areas is then given byEquation (3).

The orientation of a region correspondingto an illuninated surface can be determinedalong extremal boundaries identified by thetangency test. Reflectance of this region canthen be determined at the boundary by Equation(5), and thus throughout the region based onthe assumption that reflectance is constantover a surface.

So far, recovery has been exact; theintrinsic values and edges that can be exactlyinferred from intensity edges are shown inTable 1. Surface orientation within illuminatedregions bounded, at least in part, by extremaledges can be reasonably estimated, as follows:Equation (5) can be solved, knowing L, R, I0,and I1, for N.S, the cosine of the angle ofincidence of the direct component ofillumination, at each point. This does notuniquely determine the orientation, which hastwo degrees of freedom, but it does allowreasonable estimates to be obtained using theassumption of smoothness of surfaces and theknown orientation at extremal boundaries toconstrain the other degree of freedom. Twoexamples of how this reconstruction can be doneare given by the work of Horn [19] and Woodham[35] on "shape fron shading."

Orientation can be integrated to obtainrelative distance within the regions, and thetangency test gives distance ordering acrossthe boundary.

Since a shadowed region of surfaceappears uniform in the intensity image, itsshape cannot be determined from shadinginformation. A plausible guess can be made,however, by interpolating in from points ofknown orientation, using the smoothnessassumption. This can be done only if at leastpart of the boundary of the shadowed region canbe interpreted as extremel boundary (e.g.,using T-junctions), or as a shadow edge withthe shape on the illuminated side known.

Not surprisingly, little can be saidabout regions, shadowed or illuminated, with novisible portions of boundary identifiable asextremal (e.g. a region seen through a hole, orshadowed, with no T-junctions). It is stillreasonable to attempt to interpret suchinherently ambiguous situations, but it is thennecessary to introduce further, and perhapsless general, assumptions. For example: anobject is predominantly convex, so the sense ofan occlusion can be guessed locally from theshape of the boundary; the brightest point on an illuminated surface is probably oriented

with its normal pointing at the light source,providing a boundary condition for determiningthe surface reflectance and its shape fromshading. Of course, such assumptions must besubordinate to harder evidence, when it isavailable.

We conclude that, in this limited domain,unambiguous recovery of intrinsic character-istics at every point in an image is notgenerally possible, primarily because of thelack of information in some regions of theintensity image. Thus, in some cases, we mustbe content with plausible estimates derivedfrom assumptions about likely scenecharacteristics. When these assumptions areincorrect, the estimates will be wrong, in thesense that they will not correspond exactly tothe scene; they will, however, provide aninterpretation that is consistent with theevidence available in the intensity image, andmost of the time this interpretation will belargely correct.

Though perfect recovery is unattainable,it is remarkable how much can be doneconsidering the weakness (and hence generality)of the assumptions, and the limited number ofcues available in this domain. We used no shapeprototypes, nor object models, and made no useof any primary depth cues, such as stereopsis,motion parallax, or texture gradient. Any ofthese sources, if available, could beincorporated to improve performance byproviding information where previously it couldonly be guessed (for example, texture gradientcould eliminate shape ambiguity in shadows).

V A COMPUTATIONAL MODEL

We now propose a detailed computationalmodel of the recovery process. The modeloperates directly on the data in a set ofintrinsic images and uses parallel localoperations that codify values in the images tomake them consistent with the input image andconstraints representing the physicalassumptions about imaging and the world.

A. Establishing Constraints

Recovery begins with the detection ofedges in the intensity image. If quantizationand noise are assumed negligible in the domain,we can easily distinguish all stepdiscontinuities, and hence generate a binaryimage or intensity edges, each with anassociated direction. This image will resemblea perfect line drawing, but despite the idealconditions, there can still be gaps whereintensities on two sides or a boundary happento be identical. Usually this will occur onlyat a single isolated point -- for example, at apoint where the varying intensities on the twosides of an occlusion boundary simultaneouslypass through the same value. Complete sectionsof a boundary may also occasionally beinvisible -- for example, when a shadowed bodyoccludes itself. Our recovery process isintended to cope with these imperfections, aswill be seen later.

Given the edge image, the next step is tointerpret the intrinsic nature of the edgeelements according to the edge table.Interpretation is based on the results of twotests, constancy and tangency applied to theintensities of the regions immediately adjacent

14 H. G. Barrow and J. M. Tenenbaum

Page 13: COMPUTER VISION SYSTEMS - MITweb.mit.edu/cocosci/Papers/Barrow-Tenenbaum78.pdf · From Computer Vision Systems, A Hanson & E. Riseman (Eds.), pp. 3-26, New York: Academic Press (1978)

to an edge element. The constancy test isapplied by simply checking whether the gradientof intensity is zero. The tangency test isapplied in its differential form by checkingwhether the derivative of estimatedreflectance, taken along the edge, is zero.

The resulting edge interpretations areused to initialize values and edges in theintrinsic images in accordance with the edgetable. The table specifies, for each type ofintensity edge, the intrinsic images in whichcorresponding edge elements should be inserted.Values are assigned to intrinsic image pointsadjacent to edges, as indicated in the table.For example, if an intensity edge isinterpreted as an occlusion, we can generate anedge in the distance and orientation images,and initialize orientation and reflectanceimages at points on the occluding side of theboundary.

When the edge interpretation isambiguous, we make conservative initializa-tions, and wait for subsequent processing toresolve the issue. In the case of an extremalboundary separating two shadowed regions, thismeans assigning a discontinuity in distance,orientation and reflectance, but not assuminganything else about orientation or relativedistance. In the case of ambiguity between ashadow edge and a shadowed occluding surface,we assume discontinuities in all characteris-tics, and that the illumination and reflectanceof the shadowed region are known. It is betterto assume the possible existence of an edge,when unsure, because it merely decouples thetwo regions locally; they may still be relatedthrough a chain of constraints along some otherroute.

For points at which intrinsic values havenot been uniquely specified, initial values areassigned that are reasonable on statistical andpsychological grounds. In the orientationimage, the initial orientation is assigned asN0, the orientation pointing directly at theviewer. In the illumination image, areas ofshadow, indicated by the constancy test, areassigned value I0. The remaining directlyilluminated points are assigned value I1*N0.SIn shadowed areas, reflectance values areassigned as L/I0, and in illuminated areas,they are assigned L/(I0 + I1*N0.S). Distancevalues are more arbitrary, and we assign auniform distance, DO everywhere. The choice ofdefault values is not critical, they simplyserve as estimates to be improved by theconstraint satisfaction processes.

Following initialization, the next stepis to establish consistency of intrinsic valuesand edges in the images. Consistency within anindividual image is governed by simplecontinuity and limit constraints. In thereflectance image, the constraint is thatreflectance is constant -- that is, itsgradient must be defined and be zeroeverywhere, except at a reflectance edge.Reflectance is additionally constrained to takevalues between 0 and 1. Orientation values arealso constrained to be continuous, except atocclusion edges. The vectors must be unitvectors, with a positive component in thedirection of the viewpoint. Illumination ispositive and continuous, except across shadowboundaries. In shadowed regions, it must beccnstant, and equal to I0. Distance values mustbe continuous everywhere -- that is, their

gradient must be defined and finite, exceptacross occlusion edges. Where the sense of theocclusion is known, the sense of thediscontinuity is constrained appropriately.Distance values must always be positive.

All these constraints involve localneighborhoods, and can thus be implemented viaasynchronous parallel processes. The continuityconstraints, in particular, might beimplemented by processes that simply ensurethat the value of a characteristic at a pointis the average of neighboring values. Suchprocesses are essentially solving Laplace'sequation by relaxation.

The value at a point in an intrinsicimage is related not only to neighboring valuesin the same image, but also to values at thecorresponding point in the other images. Theprimary constraint of this sort is that imageintensity is everywhere the product ofillumination and reflectance, as in Equation(2). Incident illumination is itself a sum ofterms, one for each source. This mayconveniently be represented by introducingsecondary intrinsic images for the individualsources. The image representing diffuseillumination is constant, with value I0, whilethat for the point source is I1*N.S, where N.Sis positive and the surface receives directillumination, and zero elsewhere. Theseconstraints tie together values atcorresponding points in the input intensity,reflectance, primary and secondaryillumination, and orientation images. Theorientation and distance images are constrainedtogether by the operation of differentiation.

B. Achieving Consistency Consistency is attained when the intra-

and inter-image constraints are simultaneouslysatisfied. That occurs when values andgradients everywhere are consistent withcontinuity assumptions and constraintsappropriate to the type of image, and thepresence or absence of edges. The intraimageconstraints ensure that intrinsic character-istics vary smoothly, in the appropriate ways.Such constraints are implicit in the domaindescription, and are not usually made soexplicit in machine vision systems. The inter-image constraints ensure that the characteris-tics are also consistent with the physics ofimaging and the input intensity image. It isthese constraints that permit an estimate ofsurface shape from the shading within regionsof smoothly varying intensity [19].

Consistency is achieved by allowing thelocal constraint processes, operating inparallel, to adjust intrinsic values. Theexplicitly determined values listed in theinitialization table, however, are treated asboundary conditions and are not allowed tochange. As a result, information propagatesinto the interior of regions from the knownboundaries.

The initial assignment of intrinsic edgesis, as we have already noted, imperfect: edgesfound in the intensity image may contain gaps,introducing corresponding gaps in the intrinsicedges; certain intrinsic edges are not visiblein the intensity image -- for example, self-occlusion in shadow; some intrinsic cases wereassumed in the interests of conservatism, butthey may be incorrect. From the recoveredintrinsic values, it may be clear where futher

Characteristics from Images 15

Page 14: COMPUTER VISION SYSTEMS - MITweb.mit.edu/cocosci/Papers/Barrow-Tenenbaum78.pdf · From Computer Vision Systems, A Hanson & E. Riseman (Eds.), pp. 3-26, New York: Academic Press (1978)

edges should be inserted in (or deleted from)the corresponding intrinsic image. For example,when the gradient in an image becomes veryhigh, insertion of an edge element isindicated, and, conversely, when the differencein values across an edge becomes very small,deletion is indicated. Insertion and deletionmust operate within existing constraints. Inparticular, edge elements cannot be manipulatedwithin a single image in isolation, because alegal interpretation must always be maintained.For example, in our world, occlusion impliesedges in distance and orientation simultaneous-ly. Within an intrinsic image, continuity ofsurfaces implies continuity of boundaries.Hence, the decision to insert must takeneighboring edge elements into consideration.

Constraints and boundary conditions aredependent upon the presence or absence ofedges. For example, if a new extremal edge isinserted, continuity of distance is no longerrequired at that point, and the orientation onone side is absolutely determined.Consequently, when edge elements are insertedor deleted, the constraint satisfaction problemis altered. The constraint and edgemodification processes run continuously,interacting to perfect the originalinterpretation and recover the intrinsiccharacteristics. Figures 8 and 9 summarize theoverall organization of images and constraints.

So far we have not mentioned the role ofjunctions in the recovery process. At thispoint, it is unclear whether junctions need tobe treated explicitly since the edgeconstraints near a confluence of edges willrestrict relative values and interpretations.Junctions could also be used in an explicitfashion. When a T-configuration is detected,either during initialization or subsequently,the component edges could be interpreted via ajunction catalog, which would provide morespecific interpretations than the edge table.Detection of junctions could be performed inparallel by local processes, consistent withthe general organization of the system. Thecombinatorics of junction detection are muchworse than those of edge detection, and havethe consequence that reliability of detectionis also worse. For these reasons, it is to behoped that explicit reliance upon junctionswill not be necessary.

The general structure of the system isclear, but a number of details remain to beworked out. These include: how to correctlyrepresent and integrate inter- and intra-imageconstraints; how to insert and delete edgepoints; how to correctly exploit junctionconstraints; how to ensure rapid convergenceand stability. Although we do not yet have acomputer implementation of the complete model,we have been encouraged by experiments withsome of the key components.

We have implemented a simple scheme thatuses a smoothness constraint to estimatesurface orientation in region interiors fromboundary values. The constraint is applied bylocal parallel processes that replace eachorientation vector by the average of itsneighbors. The surface reconstructed is aquadratic function of the image coordinates. Itis smooth, but not uniformly curved, with itsboundary lying in a plane. It appears possibleto derive more complex continuity constraints

that result in more uniformly curved surfaces,interpolating, for example, spherical andcylindrical surfaces from their silhouettes.

The above smoothing process was augmentedwith another process that simultaneouslyadjusts the component of orientation in thedirection of steepest intensity gradient to beconsistent with observed intensity. The resultis a cooperative "shape from shading"algorithm, somewhat different from Woodham's[35]. The combined algorithm has the potentialfor reconstructing plausible surface shape inboth shadowed and directly illuminated regions.

VI EXTENDING THE THEORY

Our initial domain was deliberatelyoversimplified, partly for pedagogic purposesand partly to permit a reasonably exhaustiveanalysis. The approach of thoroughly describingeach type of scene event and its appearance inthe image, and then inverting to form a catalogof interpretations for each type of imageevent, is more generally applicable. Results sofar appear to indicate that while ambiguitiesincrease with increasing domain complexity,available constraints also increaseproportionately. Information needed to effectrecovery of scene characteristics seems to beavailable; it is mainly a matter of thinkingand looking hard enough for it.

In this section, we will briefly describesome of the ways in which the restrictiveassumptions of our initial domain can berelaxed, and the recovery processcorrespondingly augmented, to bring us closerto the real world.

The assumption of coutinuous, noise-freeencoding is important for avoidingpreoccupation with details of implementation,but it is essential for a realistic theory toavoid reliance upon it. With these assumptions,problems of edge detection are minimized, but,as we noted earlier, perfect line drawings arenot produced. Line drawings conventionallycorrespond to surface outlines, which may notbe visible everywhere in the image. Therecovery process we described, therefore,incorporated machinery for inserting anddeleting edges to achieve surface integrity.Relaxing the assumption of ideal encoding willresult in failure to detect some weak intensityedges and possibly the introduction of spuriousedge points in areas of high intensitygradient. Insofar as these degradations aremoderate, the edge-refinement process shouldensure that the solution is not significantlyaffected.

The assumption of constant reflectance ona surface can be relaxed by introducing a newedge type -- the reflectance edge -- whereorientation, distance, and illumination arestill continuous, but reflectance undergoes astep discontinuity. Reflectance edges boundmarkings on a surface, such as painted lettersor stripes. The features distinguishing theappearance of an illuminated reflectance edgeare that the ratio of edge intensities acrossthe edge is constant along it and equal to theratio of the reflectances and the magnitude ofintensity gradient across the edge is alsoequal to the ratio of the reflectances, and

16 H. G. Barrow and J. M. Tenenbaum

Page 15: COMPUTER VISION SYSTEMS - MITweb.mit.edu/cocosci/Papers/Barrow-Tenenbaum78.pdf · From Computer Vision Systems, A Hanson & E. Riseman (Eds.), pp. 3-26, New York: Academic Press (1978)

Figure 8 A parallel computational model for recovering intrinsic images

The basic model consists of a stack of registered arrays, representing the originalintensity image (top) and the primary intrinsic arrays. Processing is initialized bydetecting intensity edges in the original image, interpreting them according to thecatalog, and then creating the appropriate edges in the intrinsic images (as implied bythe downward sweeping arrows).

Parallel local operations (shown as circles) modify the values in each intrinsic imageto make them consistent with the intraimage continuity and limit constraints.Simultaneously, a second set of processes (shown as vertical lines) operates to makethe values consistent with interimage photometric constraints. A third set of processes(shown as Xs) operates to insert and delete edge elements, which locally inhibitcontinuity constraints. The constraint and edge modification processes operatecontinuously and interact to recover accurate intrinsic scene characteristics and toperfect the initial edge interpretation.

Characteristics From Images 17

Page 16: COMPUTER VISION SYSTEMS - MITweb.mit.edu/cocosci/Papers/Barrow-Tenenbaum78.pdf · From Computer Vision Systems, A Hanson & E. Riseman (Eds.), pp. 3-26, New York: Academic Press (1978)

Figure 9 Organization of the computationalmodel

The input intensity inage andprimary intrinsic images each havean associated gradient image and anedge image. Additional imagesrepresent intermediate results ofcomputation, such as N.S. (Allimages are shown as hexagons.) Theconstraints, shown here as circles,are of three varieties: physicallimits (e.g., 0 =< R =< 1), localcontinuity, and interimageconsistency of values and edges.Continuity constraints are inhibitedacross edges. For example, values ofillumination gradient areconstrained to be continuous, exceptat illumination edges.

its direction is the same on both sides. Thesecharacteristics uniquely identify illuminatedreflectance edges, and provide constraintsrelating intrinsic characteristics on the twosides.

In shadow, it is not possible to locallydistinguish a pure reflectance edge from anextremal edge between surfaces of differentreflectance, for which the ratio of intensitiesis also equal to the ratio of reflectances.

The existence of reflectance edgesintroduces a new type or X-shaped junction,where a shadow edge is cast across areflectance edge. The detection of an X-junction in the image unambiguously identifiesa shadow edge, since the reflectance edge maybe easily identified by the ratio test.

An interesting case of surface markingsis that of reflectance texture. Texture hascertain regular or statistical properties thatcan be exploited to estimate relative distanceand surface orientation. If we can assumestatistically uniform density of particulartextural features, the apparent density in theimage relates distance and inclination of thesurface normal to the viewing direction. Asecond cue is provided by orientation oftextural features (for example, reflectanceedge elements). As the surface is viewed moreobliquely, the orientation distribution ofimage features tends to cluster about theorientation of the extremal boundary, or of thehorizon. These cues are important, since theyprovide independent information about surfaceshape, perhaps less precisely and of lowerresolution than photometric information, but inareas where photometric information isunavailable (e.g., shadowed regions) orambiguous (e.g., an illuminated region seenthrough a hole).

The assumption of smoothness of surfacescan be relaxed by introducing a further edgetype, the intersection edge, which represents adiscontinuity in surface orientation, such asoccurs at a crease or between the faces of apolyhedron. There are two distinct ways anintersection edge can appear in an image,corresponding to whether one or both of theintersecting surfaces are visible. We shallcall these subcases "occluding" and"connecting," respectively.

At a connecting intersection edge, onlydistance is necessarily continuous, since facescan be differently painted, and illuminated bydifferent sources. The strong assumption ofcontinuity of orientation is replaced by theweaker one that the local direction of thesurface edge in three dimensions is normal tothe orientations of both surfaces forming it.The effect of this constraint is that if onesurface orientation is known, the surface edgedirection can be inferred from the image, andthe other surface orientation is then hingedabout that edge, leaving it one degree offreedom. Even when neither orientation is knownabsolutely, the existence of a connecting edgeserves to inhibit application of continuityconstraints, and thereby permit more accuratereconstruction of surface shape.

At an occluding intersection edge,nothing is known to be continuous, and the onlyconstraint is on relative distance.

18 H. G. Barrow and J. M. Tenenbaum

Page 17: COMPUTER VISION SYSTEMS - MITweb.mit.edu/cocosci/Papers/Barrow-Tenenbaum78.pdf · From Computer Vision Systems, A Hanson & E. Riseman (Eds.), pp. 3-26, New York: Academic Press (1978)

In the image, an illuminated intersectingedge can be distinguished from an extremal edgesince the intensity on both sides is varying,but the tangency test fails, and it can bedistinguished from a reflectance edge since theratio of intensities across the edge is notconstant. The constraint between surfaceorientations forming the edge makes it appearlikely that a test can also be devised fordistinguishing between connecting and occludingintersection edges.

In shadowed regions, intersection edgesare only visible when they coincide withreflectance edges, from which they aretherefore locally indistinguishable. Creases ina surface are thus invisible in shadows.

When one surface is illuminated and theother shadowed, an intersection edge cannot be locally distinguished from the case of a shadowed object occluding an illuminatedone.

Extremal and intersection boundariestogether give a great deal of information aboutsurface shape, even in the absence of otherevidence, such as shading, or familiarity withthe object. Consider, for example, the abilityof an artist to convey abstract shape with linedrawings. The local inclination of an extremalor intersection boundary to the line of sightis, however, unknown; a given silhouette can beproduced in many ways [27]. In the absence ofother constraints, the distance continuityconstraint will ensure smooth variation ofdistance at points along the boundary. Anadditional constraint that could be invoked isto assume that the torsion (and possibly alsothe curvature) of the boundary space curve isminimal. This will tend to produce planar spacecurves for boundaries, interpreting a straightline in the image as a straight line in space,or an ellipse in the image as a circle inspace. The assumption of planarity is oftenvery reasonable: it is the condition used byMarr to interpret silhouettes as generalizedcylinders [27].

The assumption of known illumination canbe relaxed in various ways. Suppose we have thesame "sun and sky" model of light sources, butdo not have prior knowledge of I0, I1 and S. Ingeneral, we cannot determine the flux densitiesabsolutely, since they trade off againstreflectance. We may, however, assign anarbitrary reflectance to one surface (forexample, assume the brightest region is white,R = 1.0 ), and then determine otherreflectances and the flux densities relative toit. The initial assignment may need to bechanged if it forces the reflectance of anyregion to exceed the limits, 0.0 < R < 1.0.

The parameters of illumination, fluxdensities I0 and I1, and unit vector S, can bedetermined by assuming reflectance andexploiting a variation of the tangency test. Ifwe have an illuminated extremal boundary,Equation (5) gives a linear equation in theparameters for each point on the edge. Theequations for any four points can be solvedsimultaneously to yield all the parameters. Theremaining points on the boundary can be usedwith the now-known illumination to verify theassumption of an extremal boundary. Anindependent check on the ratio of I0 to I1 canbe made at any shadow edge where surfaceorientation is known. This method of solvingfor illumination parameters can be extended tomultiple point sources by merely increasing the

number of points on the boundary used toprovide equations. Care is required withmultiple sources because it is necessary toknow which points are illuminated by whichsources. The method works for a centrallyprojected image, but not for an orthogonallyprojected one, since, in the latter case, allthe known surface normals are coplanar.

Even modelling illumination as a set ofpoint sources does not capture all thesubtleties of real-world lighting, whichinclude extended sources, possibly not distant,and secondary illumination. Extended sourcescause shadows to have fuzzy edges, and closesources cause significant gradients in fluxdensity and direction. Secondary illuminationcauses local perturbations of illumination thatcan even dominate primary illumination, undersome circumstances (e.g., light scattered intoshadow regions). All these effects make exactmodelling of illumination difficult, and hencecast suspicion on a recovery method thatrequires such models.

In the absence of precise illuminationmodels that specify magnitude and directiondistributions of flux density everywhere,accurate point-wise estimation of reflectanceand surface orientation from photometricequations alone is not possible. It shouldstill be possible, however, to exploit basicphotometric constraints, such as localcontinuity of illumination, along with otherimaging constraints and domain assumptions, toeffect recovery within our general paradigm. Asan example, we might still be able to find andclassify edges in the intensity image:reflectance edges still have constant intensityratios (less than 30:1) across them, shadowedges can be fuzzy with high intensity ratios,occlusion and intersection edges are generallysharp without constant ratios. The occlusionand intersection edges, together withreflectance texture gradient and continuityassumptions, should still provide a reasonableinitial shape estimate. The resulting knowledgeof surface continuity, the identifiedreflectance edges, and the assumption ofreflectance constancy enable recovery ofrelative reflectance, and hence relative totalincident flux density. The ability to determinecontinuity of illumination and to discriminatereflectance edges from other types thus allowsus to generalize Horn's lightness determinationprocess [18] to scenes with relief, occlusion,and shadows.

Having now made initial estimates ofintrinsic characteristics, it may be possibleto refine them using local illumination modelsand photometric knowledge. It may be possible,using assumptions of local monotonicity ofillumination, to decide within regions whetherthe surface is planar, curved in somedirection, or whether it inflects.

Even with all the extensions that we haveso far discussed, our scene domain is stillmuch less complex than the real world, in whichone finds specularity, transparency, luster,visible light sources, three-dimensionaltexture, and sundry other complications.Although, at first sight, these phenomena makerecovery seem much harder, they also representadditional intrinsic characteristics for morecompletely describing the scene, and poten-tially rich sources of information for formingthe description. There are also many well-known sources of information about the scene,

Characteristics from Images 19

Page 18: COMPUTER VISION SYSTEMS - MITweb.mit.edu/cocosci/Papers/Barrow-Tenenbaum78.pdf · From Computer Vision Systems, A Hanson & E. Riseman (Eds.), pp. 3-26, New York: Academic Press (1978)

that make use of multiple images, includingstereo, motion parallax, and color. We believethat the framework we have put forward can beextended to accommodate these additionalsources.

At this point, it is not clear whetheradding complexity to our domain will lead tofewer ambiguities, or more. So far, however, wehave seen no reason to rule out the feasibilityof recovering intrinsic scene descriptions inrealworld situations.

VII DISCUSSION

The concept of intrinsic images clari-fies a number of issues in vision andgeneralizes and unifies many previouslydisjoint computational techniques. In thissection, we will discuss some implications ofour work, in the contexts of systemorganization, image analysis, scene analysis,and human vision.

A. System Organization

The proper organization of a visualsystem has been the subject of considerabledebate. Issues raised include the controversyover whether processing should be data-drivenor goal-driven, serial or parallel, the levelat which the transition from iconic to symbolicrepresentation should occur, and whetherknowledge should be primarily domain-independent or domain-specific [4]. Theseissues have a natural resolution in thecontext of a system organized around therecovery of intrinsic images, as in Figure 10.

The recovery process we have outlined isprimarily data-driven and dependent only ongeneral domain assumptions. Subsequent goal-oriented or domain-specific processes (rangingfrom segmentation to object recognition) maythen operate on information in the intrinsicimages.

Intrinsic images appear to be a naturalinterface between iconic and symbolicrepresentations. The recovery process seemsinherently iconic, and suited to implementationas an array of parallel processes attempting tosatisfy local constraints. The resultinginformation is iconic, invariant, and at anappropriate level for extracting symbolic scenedescriptions. Conversely, symbolic informationfrom higher levels (e.g., the size or color ofknown objects) can be converted to iconic form(e.g., estimates of distance or reflectivity)and used to refine or influence the recoveryprocess. Symbolic information clearly plays animportant role in the perception of the three-dimensional world.

Multilevel, parallel constraint satis-faction is an appealing way to organize avisual system because it facilitatesincremental addition of knowledge and avoidsmany arbitrary problems of control andsequencing. Parallel implementations have beenused previously, but only at a single level,for recovering lightness [18], depth [28], andshape [35]. Each of these processes isconcerned with recovering one of our intrinsicimages, under specialized assumptionsequivalent to assuming values for the other im-

ages. In this paper, we have suggested how theymight be coherently integrated.

Figure 10 Organization of a visual system

Issues of stability and speed ofconvergence always arise in iterativeapproaches to constraint satisraction.Heuristic "relaxation" schemes (e.g., [32]), inwhich "probabilities" are adjusted, often usead hoc updating rules for which convergence isdifficult to obtain and prove. By contrast, thesystem we have described uses iterativerelaxation methods to solve a set of equationsand inequalities. The mathematical principlesof iterative equation solving are wellunderstood [2] and should apply to our system,at least, for a fixed set of edges. Insofar aslocal edge modifications have only localconsequences, operations such as gap fillingshould affect convergence to only a minorextent.

Speed of convergence is sometimes raisedas an objection to the use of cooperativeprocesses in practical visual systems; it isargued that such processes would be too slow inconverging to explain the apparent ability of humans to interpret an image in a fraction

20 H. G. Barrow and J. M. Tenenbaum

Page 19: COMPUTER VISION SYSTEMS - MITweb.mit.edu/cocosci/Papers/Barrow-Tenenbaum78.pdf · From Computer Vision Systems, A Hanson & E. Riseman (Eds.), pp. 3-26, New York: Academic Press (1978)

of a second. This objection can be countered inseveral ways: first, full convergence may notbe necessary to achieve acceptable accuracy;second, information propagation may bepredominantly local, influenced primarily bynearby edges; third, there are ways of speedingup long-range propagation -- for example, usinghierarchies of resolution [15].

B. Image Analysis

Image analysis is usually considered tobe the generation of a two-dimensionaldescription of the image, such as a linedrawing, which may subsequently be interpreted.We believe it is important to take a moreliberal view, and include some consideration ofthe three-dimensional meaning of image featuresin the earliest descriptions.

A topic often debated is segmentation --the process of partitioning an image intosemantically interpretable regions. Fischler[9] and others have raised a number of criticalquestions: is segmentation meaningful in adomain-independent sense, is it possible, andhow should it be done?

Partitioning of an arbitrary intensityimage into regions denoting objects is anillusory goal, unless the notion of whatconstitutes an object is precisely defined.Since objects are often collections of pieces whose association must be learned,general object segmentation seems impossible inprinciple. It seems more reasonable, on theother hand, to partition an image into regionscorresponding to smooth surfaces of uni- form reflectance. This is often the implicitgoal of programs that attempt to partition animage into regions of uniform intensity.Unfortunately, intensity does not corre- spond directly to surface characteristics.There is no way of determining whether merging two regions is meaningful, andconsequently there is no reliable criterion forterminating the merging process. Segmentationbased on intrinsic images avoids thesedifficulties.

Another elusive goal, the extraction of aperfect line drawing from an intensity image,is also impossible in principle, for similarreasons: the physical significance ofboundaries does not correlate well with themagnitude of intensity changes. Surfaceboundaries can be hard, and, in some places,impossible, to detect; shadows and texturecontribute edge points in abundance, which, inthis context, are noise. To attain a linedrawing depicting surface boundaries, we musttake into account the physical significance ofintensity discontinuities. It is quite clearfrom depth and orientation intrinsic imageswhere edges are necessary for consistency andsurface integrity. A perfect line drawing couldbe regarded as one of the products of theprocess of recovering intrinsic characteris-tics.

From this point of view, all attempts todevelop more sophisticated techniques forextracting line drawings from intensity imagesappear inherently limited. Recently, relaxationenhancement techniques for refining theimperfect results of an edge detector haveattracted considerable interest [32 and 15].These techniques manipulate edge confidencesaccording to the confidences of nearby points,iterating until equilibrium is achieved. Thisapproach is really attempting to introduce and

exploit the concept of edge continuity, anddoes lead to modest improvements. It does not,however, exploit the continuity of surfaces,nor ideas of edge type, and consequentlyproduces curious results on occasion. Moreover,as we noted earlier, convergence for ad hocupdating rules is difficult to prove.

The major problem with all the imageanalysis techniques we have mentioned is thatthey are based on characteristics of the imagewithout regard to how the image was formed.Horn at MIT for some time has urged theimportance of understanding the physical basisof image intersity variations [20]. Histechniques for determining surface lightness[18] and shape from shading [19] have had anobvious influence on our own thinking. Toachieve a precise understanding of thesephenomena, Horn considered each in isolation,in an appropriately simplified domain: a planesurface bearing regions of differentreflectance lit by smoothly varying illumina-tion for lightress, and a simple smoothlycurved surface with uniform reflectance andillumination for shading. These domains are,however, incompatible, and the techniques arenot directly applicable in domains wherevariations in reflectance, illumination, andshape may be discontinuous and confounded. Wehave attempted to make explicit the constraintsand assumptions underlying such recoverytechniques, so that they may be integrated andused in more general scenes.

The work most closely related to our ownis that of Marr [29], who has described a lay-ered organization for a general vision system.The first layer extracts a symbolic descriptionof the edges and shading in an intensity image,know as the "Primal Sketch." These features areintended to be used by a variety of pro-cessesat the next layer (e.g., stereo correla-tion,texture gradient) to derive the three-dimensional surface structure of the image. Theresulting level of description is analogous toour orientation and distance images, and iscalled the "2.5D sketch." Our general phil-osophy is similar to Marr's, but differs inemphasis, being somewhat complementary. We areall interested in understarding the organiza-tion of visual systems, in terms of levels ofrepresentation and information flow. Marr, how-ever, has concentrated primarily on understand-ing the nature of individual cues, such asstereopsis and texture, while we have concen-trated primarily on understanding the integra-tion of multiple cues. We strongly believe thatinteraction of different kinds of constraintsplays a vital role in unscrambling informationabout intrinsic scene characteristics.

A particular point of departure is Marr'sreliance on two-dimensional image descriptionand grouping techniques to perfect the primalsketch before undertaking any higher-level pro-cessing. By contrast, we attempt to immediatelyassign three-dimensional interpretations tointensity edges to initialize processing at thelevel of intrinsic images, and we maintain therelationship between intensities and inter-pretations as tightly as possible. In our view,perfecting the intrinsic images should be theobjective of early visual processing; edges atthe level of the primal sketch are theconsequence of achieving a consistent three-dimensional interpretation. We shall discussconsequences of these differing organizationswith reference to human vision shortly.

Characteristics from Images 21

Page 20: COMPUTER VISION SYSTEMS - MITweb.mit.edu/cocosci/Papers/Barrow-Tenenbaum78.pdf · From Computer Vision Systems, A Hanson & E. Riseman (Eds.), pp. 3-26, New York: Academic Press (1978)

C. Scene Analysis

Scene analysis is concerned withinterpreting images in three dimensions, interms of surfaces, volumes, objects, and theirinterrelationships. The earliest work, onpolyhedral scenes, involved extracting linedrawings and then interpreting them usinggeometric object prototypes [31, 8]. Acomplementary approach analyzed scenes ofsimple curved objects by partitioning the imageinto regions of homogeneous intensity and theninterpreting them using relational models ofobject appearances [5, 3]. Both these earlyapproaches were limited by the unreliability ofextracting image descriptions, as discussed inthe preceding section, and by the lack ofgenerality of the object prototypes used. Itwas soon discovered that to extract imagedescriptions reliably required exploitingknowledge of the scene and image formationprocess. Accordingly, attempts were made tointegrate segmentation and interpretation. Thegeneral approach was to assign sets ofalternative interpretations to regions ofuniform intensity (or color), and thenalternately merge regions with compatibleinterpretations and refine the interpretationsets. The process terminates with a smallnumber of regions with disjoint (and hopefullyunique) interpretations. Yakimovsky and Feldman[36] used Bayesian statistics for assigninginterpretations and guiding a search for theset of regions and interpretations with thehighest joint likelihood. Tenenbaum and Barrow(IGS [33]) used an inference procedure, similarto Waltz's filtering [34], for eliminatinginconsistent interpretations. These systemsperformed creditably upon complex images andhave been applied in a variety of scenedomains. They are not, however, suitable asmodels of general-purpose vision because theyare applicable only when all objects are knownand can be distinguished on the basis of localevidence or region attributes. Unknown objectsnot only cannot be recognized; they cannot evenbe described.

What seems needed in a general-purposevision system are more concise and more generaldescriptions of the scene, at a lower levelthan objects [4 and 38]. For example, once thescene has been described at the level ofintrinsic surface characteristics, surfaces andvolumes can be found and objects can thenreadily be identified. There is still the needto guide formation of lower level descriptionsby the context of higher level ones, but nowthe gaps between levels are much smaller andeasier to bridge.

Huffman [21], Clowes [6], and Waltz [34]demonstrated the possibility of interpretingline drawings of polyhedral scenes without theneed for specific object prototypes. Theirlevel of description involved a small set oflinear scene features (convex edge, concaveedge, shadow edge, crack) and a set of cornerfeatures, where such edges meet. These scenefeatures were studied systematically to derivea catalog of corresponding image features andtheir alternative interpretations.Interpretation of the line drawing involved acombinatorial search for a compatible set ofjunction labels. Waltz, in particular,identified eleven types of linear feature, andthree cases of illumination for the surfaceson each side. The resulting catalog of junction

interpretations contained many thousands ofentries. To avoid combinatorial explosion indetermining the correct interpretations, Waltzused a pseudo-parallel local filtering paradigmthat eliminated junction interpretationsincompatible with any possible interpretationof a neighboring junction.

While we also create and use a catalog,the whole emphasis of our work is different. Wehave attempted to catalog the appearances ofedges in grey-scale images, for a much widerclass of objects, and have described them in away that results in a much more parsimoniouscatalog. Instead of interpreting an ideal linedrawing, we, in a sense, are attemptingsimultaneously to derive the drawing and tointerpret it, using the interpretation to guidethe derivation. In contrast to the junctions inline drawings, many gray-scale image featurescan be uniquely interpreted using intensityinformation and physical corstraints. We arethus able to avoid combinatorial search and"solve" directly for consistent interpretationsof remaining features. Our solution has adefinite iconic flavor, whereas Waltz’s haslargely a symbolic one.

Mackworth's approach to interpreting linedrawings [24] is somewhat closer to our pointof view. He attempts to interpret edges, ratherthan junctions, with only two basicinterpretations (Connect and Occluding); hemodels surfaces by representing their planeorientations explicitly; and he tries to solveconstraints to find orientations that areconsistent with the edge interpretations. Theuse of explicit surface orientation enablesMackworth to reject certain interpretationswith impossible geometry, which are accepted byWaltz. Since he does not, however, makeexplicit use of distance information, there arestill some geometrically impossibleinterpretations that Mackworth will accept.Moreover, since he does not use photometry, hissolutions are necessarily ambiguous: Horn hasdemonstrated [20] that when photometricinformation is combined with geometry, theorientations of surfaces forming a trihedralcorner may be uniquely determined. Onefundamental point to be noted is that intrinsiccharacteristics provide a concise descriptionof the scene that enables rejection ofphysically impossible interpretations.

Intermediate levels of representationhave played an increasingly important role inrecent scene analysis research. Agin andBinford [1] proposed a specific representationof three-dimensional surfaces, as generalizedcylinders, and described a system forextracting such representations using a laserrangefinder. Marr and Nishihara [29] havedescribed techniques for inferring ageneralized cylinder representation from linedrawings, and for matching geometrically toobject prototypes in recognition. Cylindricalvolume representations are being used in theVISIONS system, under development by Riseman etal. [16]. This system also includes explicitrepresentation of surfaces, which are inferredfrom a two-dimensional image description usinghigher-level cues, such as linear perspectiveand the shape of known objects. Symbolic repre-sentations at the level of surfaces and volumesshould be easier to derive from intrinsicimages than from intensity images, linedrawings, or even from noisy rangefinder data.

22 H. G. Barrow and J. M. Tenenbaum

Page 21: COMPUTER VISION SYSTEMS - MITweb.mit.edu/cocosci/Papers/Barrow-Tenenbaum78.pdf · From Computer Vision Systems, A Hanson & E. Riseman (Eds.), pp. 3-26, New York: Academic Press (1978)

D. Human Vision

In this paper, we have been concernedwith the computational nature of vision tasks,largely independent of implementation inbiological or artificial systems. Thisorientation addresses questions of competence:what information is needed to accomplish atask, is it recoverable from the sensed image,that additional constraints, in the form ofassumptions about the world, are needed toeffect the recovery?

Psychologists have been asking similarquestions fron their own viewpoint for manyyears. For example, there has been a long-standing debate, dating back at least toHelmholtz, concerning how, and under whatcircumstances, it is possible to independentlyestimate illumination and reflectance. Recentparticipants include Land, with his Retinextheory of color perception [23], and Gilchrist,who has identified a number of ways in whichintensity edges may be classified (e.g., the30:1 ratio, intersecting reflectance andillumination edges) [12].

From such work, a number of theories havebeen proposed to explain human abilities toestimate various scene characteristics. No one,however, has yet proposed a comprehensive modelintegrating all known abilities. While we haveno intention of putting forward our model as acomplete explanation of human vision, we thinkthat the recovery of intrinsic characteristicsis a plausible role for early stages of visualprocessing in humans.

Figuree 11 A subjective contour

This hypothesis would appear to explain,at least at a superficial level, many well-known psychological phenomena. The constancyphenomena are the obvious examples, but thereare others. Consider, for example, thephenomenon of subjective contours, such asappear in Figure 11. Marr suggests [26] thatsuch contours result from grouping place tokenscorresponding to line endings in the primalsketch, and further suggests a "least-commitment" clustering algorithm to control thecombinatorics of grouping. We suggest, as analternative explanation, that the abrupt line

endings are locally interpreted threedimensionally as evidence of occlusion, causingdiscontinuity edges to be established in thedistance image. The subjective contours thenresult fron making the distance imageconsistent with these boundary conditions andassumptions of continuity of surfaces andsurface boundaries: they are primarily contoursof distance, rather than intensity. The netresult is the interpretation of the image as adisk occluding a set of radiating lines on amore distant surface.

Figure 12 Subjective depth (Coren [7])

There is considerable evidence to supportthe hypothesis that subjective contours areclosely correlated with judgements of apparentdepth. Coren [7] reports a very interestingdemonstration. In Figure 12, the two circlessubtend the same visual angle; however, theapparent elevation of the subjective trianglecauses the circle within it to be perceived assmaller, consistent with the hypothesis that itis nearer. Surfaces perceived when viewingJulesz stereograms [22] have edges that arepurely subjective. There are no distinguishingcues in the originating intensity images: theedges result solely from the discontinuity indisparity observed in a stereo presentation.Hochberg [17] has investigated the subjectivecontours produced by shadow cues to depth, seenin figures like Figure 13. Most observersreport that Figure 13a is perceived as a singleentity in relief, while Figure 13b is not. Thefigures are essentially equivalent as two-dimensional configurations of lines. Thedifference between the figures is that in b,the lines are not consistent with the shadowsof a single solid entity cast by a directionallight source.

We can generalize our argument thatsubjective contours arise as a consequence ofthree-dimensional organization to otherphenomena of perceptual organization, such asthe Gestalt laws. For example, the law ofcontinuity follows directly from assumptionsof continuity of surfaces and boundaries, and

Characteristics from Images 23

Page 22: COMPUTER VISION SYSTEMS - MITweb.mit.edu/cocosci/Papers/Barrow-Tenenbaum78.pdf · From Computer Vision Systems, A Hanson & E. Riseman (Eds.), pp. 3-26, New York: Academic Press (1978)

the law of closure follows from integrity ofsurfaces.

Figure 13 Subjective figures (Hochberg [17])

A system that is attempting to form aconsistent interpretation of an image as athree-dimensional scene will, in principle, dobetter than one that is attempting to describeit as a two-dimensional pattern. Theorganization of a chaotic collection of imagefeatures may become clear when they areconsidered as projections of three-dimensionalfeatures, and the corresponding constraints arebrought into play.

Gregory has emphasized the importance ofthree-dimensional interpretations and hassuggested that many illusions result fromattempting to form inappropriate three-dimensional interpretations of two-dimensionalpatterns [13 and 14]. He also suggests thatcertain other illusions, such as thoseinvolving the Ames room, result from applyingincorrect assumptions about the nature of thescene. The distress associated with viewingambiguous figures, such as the well-known“impossible triangle” and “devil's pitchfork,”arises because of the impossibility of makinglocal evidence consistent with assumptions ofsurface continuity and integrity.

Recent experiments by Gilchrist [11]demonstrate that judgements of the primary in-

trinsic characteristics are tightly integratedin the human visual system. In one experiment,the apparent position of a surface ismanipulated using interposition cues, so thatit is perceived as being either in an area ofstrong illumination or in one of dimillumination. Since the absolute reflected fluxfrom the surface remains constant, the observerperceives a dramatic change in apparentreflectance, virtually from black to white. Ina second experiment, the apparent orientationof a surface is changed by means of perspectivecues. The apparent reflectance again changesdramatically, depending upon whether thesurface is perceived as facing towards or awayfrom the light source.

We do not present our model of therecovery of intrinsic scene characteristics asa comprehensive explanation of these psycho-logical phenomena. We feel, however, that itmay provide a useful viewpoint for consideringat least some of them.

VIII CONCLUSION

The key ideas we have attempted to conveyin this paper are:

* A robust visual system should beorganized around a noncognitive,nonpurposive level of processing thatattempts to recover an intrinsicdescription of the scene.

* The output ot this process is a set ofregistered "intrinsic images" that givevalues for such characteristics asreflectance, orientation, distance, andincident illumination, for everyvisible point in the scene.

* The information provided by intrinsicimages greatly facilitates many higher-level perceptual operations, rangingfrom segmentation to objectrecognition, that have so far proveddifficult to implement reliably.

* The recovery of intrinsic scenecharacteristics is a plausible role forthe early stages of human visualprocessing.

* Investigation of low-level processingshould focus on what type ofinformation is sought, and how it mightbe obtained from the image. Forexample, the design of edge detectorsshould be based on the physical meaningof the type of edge sought, rather thanon some abstract model of an intensitydiscontinuity.

We have outlined a possible model of therecovery process, demonstrated its feasibilityfor a simple domain, and argued that it can beextended in a straightforward way towards real-world scenes. Key ideas in the recovery processare:

* Information about the intrinsiccharacteristics is confounded in theintensities of the input image.Therefore, recovery depends onexploiting assumptions and constraintsfrom the physical nature of imaging andthe world.

* Interactions and ambiguities prohibitindependent recovery of the intrinsic

24 H. G. Barrow and J. M. Tenenbaum

Page 23: COMPUTER VISION SYSTEMS - MITweb.mit.edu/cocosci/Papers/Barrow-Tenenbaum78.pdf · From Computer Vision Systems, A Hanson & E. Riseman (Eds.), pp. 3-26, New York: Academic Press (1978)

characteristics; the recovery processmust determine their valuessimultaneously in achievingconsisteney with the constraints.

* Interpretation of boundaries plays akey role in recovery; they provideinformation about which characteristicsare continuous and which discontinuousat each point in the image, and theyprovide explicit boundary conditionsfor the solution.

* The nature of the solution, involving alarge number of interacting localconstraints, suggests implementation ofthe recovery process by an array oflocal parallel processes that achieveconsistency by a sequence of successiveapproximations.

Our model for recovering intrinsiccharacteristics is at a formative stage.Important details still to be finalized includethe appropriate intrinsic images, constraints,constraint representations, and the paths ofinformation flow relating them. Nevertheless,the ideas we have put forth in this paper havealready clarified many issues for us andsuggested many exciting new prospects. Theyalso raise many questions to be answered byfuture research, the most important of whichare "Can it work in the real world?" and "Dopeople see that way?" To the extent that ourmodel corresponds to the human visual system,valuable insights may be gained throughcollaboration between computer and vision.

IX ACKNOWLEDGEMENTS

This work was supported by a grant fromthe National Science Foundation.

Edward Riseman's persistent exhortationsprovided the necessary motivation to commit ourideas to paper, and his valuable editorialcomments improved their presentation.

Bill Park's artistic hand penned thedrawings of intrinsic images.

REFERENCES

1. Agin, G. J. and Binford, T.O. Computerdescription of curved objects. Proc . 3rd . International Joint Conference on Artificial Intelligence , StanfordUniversity, Stanford, 1973.

2. Allen, D. N. de G. Relaxation Methods in Engineering and Science . McGraw-Hill, NewYork, 1954.

3. Barrow, H. G., and Popplestone, R. J.Relational descriptions in pictureprocessing. in Machine Intelligence , Vol.6, B. Meltzer and D. Michie (Eds.),Edinburgh University Press, Edinburgh,Scotland, 1971, pp 377-396.

4. Barrow, H. G., and Tenenbaum, J. M.Representation and use of knowledge invision. Tech. Note 108, ArtificialIntelligence Center, SRI International,Menlo Park, CA, 1975.

5. Brice, C., and Fennema, C. Scene analysisusing regions. Artificial Intelligence , 1,No. 3, 1970, 205-226.

6. Clowes, H. B. On seeing things. Artificial Intelligence , 2, No. 1, 1971, 79-112.

7. Coren, S. Subjective contour and apparentdepth. Psychological Review , 79, 1972, 359.

8. Falk, G. Interpretation of imperfect linedata as a three-dimensional scene. Artificial Intelligence , 4, No. 2, 1972,101-144.

9. Fischler, M. A. On the representation ofnatural scenes, in Computer Vision Systems ,A. Hanson and E. Riseman (Eds.), AcademicPress, New York, 1978.

10. Garvey, T. D. Perceptual strategies forpurposive vision. Tech Note 117,Artiticial Intelligence Center, SRIInternational, Menlo Park, CA, 1976.

11. Gilchrist, A. L. Perceived lightnessdepends on perceived spatialarrrangement. Science , 195, 1977, 185-187.

12. Gilchrist, A. L. Private communication.

13. Gregory, R. L., Eye and Brain . Weidenfeldand Nicholson, London, 1956.

14. Gregory, R. L. The Intelligent Eye . McGraw-Hill, New York, 1970.

15. Hanson, A., and Riseman, E, Segmentation ofnatural scenes, in Computer Vision Systems , A. Hanson and E. Riseman (eds.),Academic Press, New York, 1978.

15. Hanson, A., and Riseman, E. VISIONS. Acomputer system for interpreting scenes,in Computer Vision Systems, A. Panson andE. Riseman (Eds.), Academic Press, NewYork, 1978.

17. Hocherg, J. In the mind's eye, in Contemporary Theory and Research in Visual Perception , R. N. Haber (Ed.),Holt, Rinehart and Winston, New York,1960.

18. Horn, B. K. P. Determining Lightness froman image. Computer Graphics and Image Processing , 3, 1974, 277-299.

19. Horn, B. K. P. Obtaining shape from shadinginformation. in The Psychology of Computer Vision , P. H. Winston (Ed.),McGraw-Hill, New York, 1975.

20. Horn, B. K. P. Understanding imageintensities. Artificial Intelligence , 8,No. 2, 1977, 201-231.

21. Huffman, D. A. Impossible objects asnonsense sentences. in Machine Intelligence , Vol. 6, B. Meltzer and D.Michle (Eds.), Edinburgh UniversityPress, Edinburgh, 1971, pp 295-323.

Characteristics from Images 25

Page 24: COMPUTER VISION SYSTEMS - MITweb.mit.edu/cocosci/Papers/Barrow-Tenenbaum78.pdf · From Computer Vision Systems, A Hanson & E. Riseman (Eds.), pp. 3-26, New York: Academic Press (1978)

22. Julesz, B. Foundations of Cyclopean Perception . University of Chicago Press,Chicago, 1971.

23. Land, E. H. The Retinex theory of colorvision. Scientific American , 237, No. 6,Dec. 1977, 108-128.

24. Mackworth, A. K. Interpreting pictures ofpolyhedral scenes. Artificial Intelli- gence , 4, 1973, 121-138.

25. Macleod, I. D. G. A study in automaticphoto-interpretation. Ph. D. thesis,Department of Engineering Physics,Australian National University, Canberra,1970.

26. Marr, D. Early processing of visualinformation. AI Memo-340, ArtificialIntelligence Lab., MIT, Cambridge, Mass.,1976.

27. Marr, D. Analysis of occluding contour. Proc . Roy . Soc . Lond. B, 197, 1977, 441-475.

28. Marr, D., and Poggio, T. Cooperativecomputation of stereo disparity. Science ,194, 1977, 283-287 .

29. Marr, D. Representing visual information.in Computer Vision Syetems , A. Hanson andE. Riseman (Eds.), Academic Press, NewYork, 1978.

30. Nitzan, D., Brain, A. E., and Duda, R. O.The measurement and use of registeredreflectance and range data in sceneanalysis. Proc . IEEE , 65, No. 2, 1977,206-220.

31. Roberts, L. G. Machine perception of three-dimensional solids. in Optical and Electro-optical Information Processing ,J. T. Tippett et al. (Eds.), MIT PressCambridge, Mass., 1965.

32. Rosenfeld, A., Hummel, R. A., and Zucker,S. W. Scene labeling by relaxationoperations", IEEE Transactions on Systems , Man and Cybernetics , SMC-5 ,1976, 420-433.

33. Tenenbaum, J. M., and Barrow, H. G.Experiments in interpretation-guidedsegmentation. Artificial Intelligence , 8,No. 3, 1977, 241-274.

34. Waltz, D. L. Generating semanticdescriptions from drawings of scenes withshadows. Tech. Rept. AI-TR-271,Artificial Intelligence Lab., MIT,Cambridge, Mass., Nov. 1972.

35. Woodham, R. J. A cooperative algorithm fordetermining surface orientation from asingle view. Proc . 5th . International Joint Conference on Artificial Intelligence , MIT, Cambridge, Mass.,1977.

36. Yakinovsky, Y. Y., and Feldman, J. Asemantics-based decision theoretic regionanalyzer. Proc . 3rd . International Joint Conference on Artificial Intelligence ,Stanford Univ., Stanford, 1973.

37. Yonas, A. Private communication, Instituteof Child Development, University ofMinnesota, 1977.

38. Zucker, S., Rosenfeld, A., and Davis, L.General purpose models: Expectationsabout the unexpected. Proc . 4th . International Joint Conference on Artificial Intelligence , Tbilisi,Georgia, USSR, 1975.

26 H. G. Barrow and J. M. Tenenbaum