6
A Head Called Richard P. Mowforth*, P. Siebertt, Z. P. Jin* & C. Urquhart* •The Turing Institute, George House, 36 North Hanover Street, Glasgow, Scotland, Gl 2AD; tBBN Systems and Technologies, Edinburgh, UK, EH14 4AP; tBBN and Dept. of EEE, Heriot Watt University, Edinburgh, UK, EH1 2HT This paper describes two preliminary experiments con- cerned with the construction of a robot head. The initial design and research is aimed at producing a system with two cameras and two microphones on a system capable of operating with the same degrees of freedom and reflex times as its biological counter- part. Whilst the primary goal of the project is de- velop an anthropomorphic system with the sensory re- flex capabilities of a human head, the system will also contain some non-anthropomorphic components. The most obvious of the non-anthropomorphic com- ponents is a spatially and temporally programmable light source. Some preliminary results are presented. use this information to direct autonomous behaviour (Brooks, 1989) but these reflexes have not been cou- pled with vision systems for integrated tracking or gaze control. This paper outlines several of the ideas behind a project to build a robot head and provides some preliminary results for one of the reflex systems as well as giving some details of its integrated texture projection system. This novel projection system allows the head to project illumination onto a scene with computer control in both temporal and spatial (textural) domains. Whilst focus and aperture have been readily automated in devices such as the camcorder, the automation of ver- gence, tracking and the basic orientation reflexes are not well advanced. Such built-in reflexes are critical com- ponents of any sophisticated vision system. Without a vergence or slow tracking system, stereo and motion solutions are limited to a narrow operational cange in- volving the limits of data fusion. Gaze control allows the system to direct attention, for example, to get a particular view which helps reconcile some ambiguity in an otherwise static system. Gaze may also be used to provide closed-loop tracking which may help simplify the computation of intrinsic image information. There have been few attempts to build vision systems that include cameras which can be motor driven on some of their degrees of freedom. Probably the best two examples are those at Rochester [7] and at Mu- nich [6]. Each of these two systems are strictly limited in terms of their functionality. Early versions of the Rochester system used a vergence control system driv- ing the cameras in opposite directions so as to maintain coincidence between two images whilst the Munich sys- tem allows XY gaze control of a single camera directed at a target feature. The integration of other sensory modalities into orien- tation reflexes for directing, for example, a vision sys- tem has also received little attention. Systems have been built which attempt to localise sound sources and VERGENCE CONTROL Vergence control is necessary to bring images captured from two locations in space into approximate correspon- dence. This may be used directly to provide informa- tion about range as well as providing a stereo process with favourable starting conditions. Stereo matching algorithms contain fusional limits for the maximum al- lowable disparity; the vergence mechanism simply at- tempts to maximise the amount of the image pair which can then be fused. Rather than provide a globally best vergence signal from a stereo pair, vergence mechanisms typically attempt to find correspondence between small subsets of the images, i.e. central foveas [3]. Whilst one important requirement for the project is real time execution of the vergence mechanism, the require- ment for precision is not as high as for the stereo pro- cess. Based on the above understanding, we have developed a vergence control algorithm which fulfills the above 361 BMVC 1990 doi:10.5244/C.4.64

A Head Called Richardsive mode using unstructured lighting, e.g. outdoors or when subject to extraneous illumination fields, and hence will be inherently robust in operation. In short,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Head Called Richardsive mode using unstructured lighting, e.g. outdoors or when subject to extraneous illumination fields, and hence will be inherently robust in operation. In short,

A Head Called Richard

P. Mowforth*, P. Siebertt, Z. P. Jin* & C. Urquhart*

•The Turing Institute, George House, 36 North Hanover Street, Glasgow, Scotland, Gl 2AD;tBBN Systems and Technologies, Edinburgh, UK, EH14 4AP;

tBBN and Dept. of EEE, Heriot Watt University, Edinburgh, UK, EH1 2HT

This paper describes two preliminary experiments con-cerned with the construction of a robot head. Theinitial design and research is aimed at producing asystem with two cameras and two microphones ona system capable of operating with the same degreesof freedom and reflex times as its biological counter-part. Whilst the primary goal of the project is de-velop an anthropomorphic system with the sensory re-flex capabilities of a human head, the system willalso contain some non-anthropomorphic components.The most obvious of the non-anthropomorphic com-ponents is a spatially and temporally programmablelight source. Some preliminary results are presented.

use this information to direct autonomous behaviour(Brooks, 1989) but these reflexes have not been cou-pled with vision systems for integrated tracking or gazecontrol.

This paper outlines several of the ideas behind a projectto build a robot head and provides some preliminaryresults for one of the reflex systems as well as givingsome details of its integrated texture projection system.This novel projection system allows the head to projectillumination onto a scene with computer control in bothtemporal and spatial (textural) domains.

Whilst focus and aperture have been readily automatedin devices such as the camcorder, the automation of ver-gence, tracking and the basic orientation reflexes are notwell advanced. Such built-in reflexes are critical com-ponents of any sophisticated vision system. Withouta vergence or slow tracking system, stereo and motionsolutions are limited to a narrow operational cange in-volving the limits of data fusion. Gaze control allowsthe system to direct attention, for example, to get aparticular view which helps reconcile some ambiguityin an otherwise static system. Gaze may also be usedto provide closed-loop tracking which may help simplifythe computation of intrinsic image information.

There have been few attempts to build vision systemsthat include cameras which can be motor driven onsome of their degrees of freedom. Probably the besttwo examples are those at Rochester [7] and at Mu-nich [6]. Each of these two systems are strictly limitedin terms of their functionality. Early versions of theRochester system used a vergence control system driv-ing the cameras in opposite directions so as to maintaincoincidence between two images whilst the Munich sys-tem allows XY gaze control of a single camera directedat a target feature.

The integration of other sensory modalities into orien-tation reflexes for directing, for example, a vision sys-tem has also received little attention. Systems havebeen built which attempt to localise sound sources and

VERGENCE CONTROLVergence control is necessary to bring images capturedfrom two locations in space into approximate correspon-dence. This may be used directly to provide informa-tion about range as well as providing a stereo processwith favourable starting conditions. Stereo matchingalgorithms contain fusional limits for the maximum al-lowable disparity; the vergence mechanism simply at-tempts to maximise the amount of the image pair whichcan then be fused. Rather than provide a globally bestvergence signal from a stereo pair, vergence mechanismstypically attempt to find correspondence between smallsubsets of the images, i.e. central foveas [3].

Whilst one important requirement for the project is realtime execution of the vergence mechanism, the require-ment for precision is not as high as for the stereo pro-cess.

Based on the above understanding, we have developeda vergence control algorithm which fulfills the above

361BMVC 1990 doi:10.5244/C.4.64

Page 2: A Head Called Richardsive mode using unstructured lighting, e.g. outdoors or when subject to extraneous illumination fields, and hence will be inherently robust in operation. In short,

Estimate ofvergence control

amount

Figure 1: Software architecture of the vergence controlalgorithm.

Figure 2: A natural corridor scene.

Signal Matching algorithm developed for matching instereo and motion computation [5]. Like MSSM, thealgorithm also produces a confidence value accompany-ing its estimate of vergence.

mentioned requirements.

Algorithm

The algorithm design is strongly constrained by the re-quirement for fast execution time. To achieve this, apyramid image architecture has been adopted. Figure 1provides a summary of the software architecture for thevergence control algorithm.

First a pyramid is built from the input stereogram.Then, the image pair is blurred by a small sigma Gaus-sian function at each pyramid level. This filtering isnecessary in order to achieve left-right matches at eachlevel. Next, a small fixed size window is applied to theleft and right images, again at each level. The windowis able to move to the left or to the right. Correlation isnow carried out between the image pair masked by thesewindows as a search operation. The search starts fromthe image pair at the top of the pyramid using a zerooffset of the windows. By moving the windows eitherto the left or to the right, the best match is looked forand obtained which then serves as the starting offset ofthe windows at the next lower level of the pyramid andanother round of search begins. This process contin-ues until the best match has been found at the bottomof the pyramid. The window offset which leads to thisbest match is used for vergence control.

The progressive focusing of the solution through a seriesof octave separated levels is effectively a coarse-to-finefovea. The algorithm is based on the Multiple Scale

Performance

The algorithm is programmed in C and runs on a sunSPARCstation 1. For a 128 x 128 8 bits per pixel imagepair, the typical time consumed by different part of theprogram is 20-30ms to read in the image pair, 20-30msto build the pyramid, 30-40ms for matching and search-ing and 10-20ms for the other miscellaneous operations.The total time taken to produce a vergence estimate is80-120ms. The algorithm is of pixel accuracy.

The algorithm has been tried out on a natural corridorscene and a random-dot image. From each of thesetwo images, two, partially overlapping sub-images wereextracted with known amounts of offset. A third testwas performed where various amounts of Gaussian noisewas added to the corridor stereogram.

Figure 2 is a natural corridor scene which served as abase image from which stereograms of different dispar-ity were derived.

Figure 3 is a stereogram derived from the natural cor-ridor image.

Figure 4 is the same stereogram as above but withadded Gaussian random noise.

Figure 5 a is a random-dot image; b is a stereogramderived from a.

Figure 6 shows that with a natural image pair, the algo-rithm may properly function in the symmetric intervalfrom -51 to 51 pixels. With added noise, the confidence

362

Page 3: A Head Called Richardsive mode using unstructured lighting, e.g. outdoors or when subject to extraneous illumination fields, and hence will be inherently robust in operation. In short,

Figure 3: A stereogram derived from the natural corri-dor scene image.

Figure 4: A noise corrupted natural corridor scenestereogram.

i

Corridor:

1 1

<7/

cofidence %

100

\

a =10

o=50

1 1 ,-128 -64 5164 128 pixel

Random-dotstereogram

1 1

cofidence %

100

1 1-128 -64 0 32 64 128 pixel

Figure 6: Algorithm performance - see text for details.

value fell according to the amount of noise present. Theestimate of error was 2 pixels for a disparity of 51 pixelsand noise sigma of 50.

Figure 5: a random-dot image,ogram derived from a.

b random-dot stere-

INTEGRATED DYNAMIC PRO-JECTIONAll vision systems that are intended to sense objectsthat do not intrinsically radiate their own light energyrequire some form of illumination source by which tosense such objects. Indeed, the majority of vision instal-lations utilise some form of artificial illumination whenoperating indoors. Under such circumstances we arepresented with the opportunity of utilising a controlledillumination source that may be structured to enhancethe operation of any vision algorithms executed by therobot head.

363

Page 4: A Head Called Richardsive mode using unstructured lighting, e.g. outdoors or when subject to extraneous illumination fields, and hence will be inherently robust in operation. In short,

Accordingly, a novel feature of the robot head design isthe incorporation of a programmable projection basedillumination source. This source will illuminate thescene using a steerable projector that tracks the mo-tion of the stereo cameras to ensure that the projectorfield of illumination remains congruent with the fieldsof view of the stereo cameras x.

Integration of Active and Passive Sensing

Active vision techniques typically employ some form ofstructured illumination (e.g. light striping, moire fring-ing etc. [2]) and rely upon interpreting contrast differ-ences provided by this illumination to compute surfacerange estimates. By definition, active techniques em-ploy interpretation algorithms that are intimately cou-pled to the structure of the adopted lighting source andare accordingly ad hoc in nature, i.e. incapable of op-erating in a purely passive mode and thereby utilisingunstructured lighting.

We intend to effect the integration of active and pas-sive sensing techniques by providing controlled illumi-nation to bear on specific interpretation problems thatgive rise to interpretation ambiguity, e.g. when the illu-mination configuration details are unknown and shad-ing analysis is attempted. Previous attempts to com-bine active and passive sensing modalities have beenreported [10], however these simply combine the resultsof passive interpretation with structured illuminationbased ranging. Our approach has been to utilise spe-cific modes of illumination suited to particular passiveinterpretation tasks. When shading analysis is beingperformed, a simple illumination field of known char-acteristic can be provided. When stereo interpretationis being performed, the projector can be switched toilluminate the scene with a Gaussian texture field. Wehave previously reported [8] that the performance ofthe scale space stereo algorithms can be dramaticallyimproved by the projection of textured light onto thescene. Hence a projection mode would be selected ac-cording to the interpretation task at hand. Indeed in-terpretation algorithms that perform the tasks of stereoand shape-from-shading exhibit complementary perfor-mance characteristics depending upon the density ornature of intrinsic surface texture patterns on the sur-faced of objects contained in the scene.

Several advantages can be gained by adopting this ap-proach to integrating active and passive vision tech-niques. Controlled programable illumination makespossible the investigation, development and utilisationof passive interpretation algorithms under ideal sens-ing conditions. From the perspective of vision researchthis is highly expedient as individual sensing mecha-

1A broad analogy might be the lamp on a miner's helmet!

nisms become increasingly better understood in isola-tion while their combination remains difficult. From asystems perspective, the robot head will be capable ofoperating, albeit with degraded performance, in a pas-sive mode using unstructured lighting, e.g. outdoorsor when subject to extraneous illumination fields, andhence will be inherently robust in operation.

In short, the system will be based passive sensing tech-niques but can also be enhanced in operation by the useof controlled active illumination. This paradigm com-pliments the major project goal of utilising a dynamicapproach to visual interpretation, in the sense of em-ploying sensor platform mobility or motion to resolveinterpretation ambiguity [1].

Enhancement of Stereo Using TexturedLight

As stated above, we have undertaken a number of ex-periments that investigate enhancing a passive scale-space stereo algorithm, MSSM [5] by illuminating theimaged scene with a Gaussian noise field, i.e. texturedlight.

There are four main problems which can lead to matchfailures in the stereo matching process [4]:

• Photometric variations at a point viewed from twoangles.

• Lack of texture in a region.

• Presence of repetitive texture patterns.

• Occlusion between camera views of some points instereo pair.

The problem of photometric variations is overcome inthe MSSM algorithm by utilising a covariance measureto search for disparity estimates that does not rely uponcorrelating absolute grey-level intensities. Random tex-ture projection both addresses the problem of matchfailure where the intrinsic texture is sparse and counter-acts the matching ambiguity caused by repetitive tex-ture patterns.

Projected random texture gives rise to an extendedspectral distribution that provides matching energy atall scales in scale space. The low frequency limit of thetexture spectral distribution sets the maximum scalesize and hence disparity that can be matched unam-biguously. The high frequency limit of the spectral dis-tribution sets the smallest scale size and hence the finaldegree of scale space match refinement. Ideally the tex-ture spectrum should be constant or "white" to achieve

364

Page 5: A Head Called Richardsive mode using unstructured lighting, e.g. outdoors or when subject to extraneous illumination fields, and hence will be inherently robust in operation. In short,

Figure 7: Left image captured using random textureprojection. Right image captured using white light asnormal.

Figure 8: 2DFT power spectrum of images from figure7 respectively.

the optimum autocorrelation characteristics. However,in practice random texture projection can provide "ran-dom dot stereograms" of the imaged scene and therebycounter both lack of texture and the effects of repetitiveintrinsic surface markings.

We have investigated in detail [9] the optimum spec-tral characteristics of the projected textured field (forour current camera/projection apparatus) in terms ofthe texture size and aspect ratio that yields the great-est horizontally biased spectral energy envelope in thesensed image. Figure 7 shows the left hand images ofstereo pairs with and without texture projection respec-tively. In this example the texture was provided bya simple overhead projector containing a laser printedtexture transparency. Figure 8 shows the increasedspectral energy due to texture projection, comparingthe DFT power spectrum of these images. Figure 9shows wire-frame reconstructions of the range imagesurface obtained by matching the texture projectedstereo pair using MSSM.

Briefly, the final cause of match failure listed, occlu-

Figure 9: Wire-frame reconstructions of range informa-tion recovered from textured projected stereo pair (leftimage of figure 7.)

365

Page 6: A Head Called Richardsive mode using unstructured lighting, e.g. outdoors or when subject to extraneous illumination fields, and hence will be inherently robust in operation. In short,

sions, can be detected when utilising active texture pro-jection as low confidence matches. We have investigated[9] detecting occlusions using both this effect and theshape of the correlation search graph itself.

Discussion

This paper has demonstrated two applications of Mul-tiple Scale Signal Matching as part of a single project.The work has helped outline a framework for the devel-opment of an anthropomorphic head capable of a num-ber of integrated reflex functions. Because it is recog-nised that there will be occasions where the head is un-able to provide good measures of its environment, thesystem is enhanced with the capability for active illumi-nation designed to help facilitate its otherwise passivesystems.

It is intended to extend the use of the signal match-ing technology to provide fast matching between micro-phone signals from the robot head. The purpose herewill be to provide orientation information so enablingsounds to provide a direction reflex for vision.

Hertzberger and F. C. A. Groen, editors, Intelli-gent Autonomous Systems Conference, pages 477-483, Amsterdam, 1987. North Holland.

[7] T. J. Olson and R. D. Potter. Real-time ver-gence control. Research Report TR-264, Universityof Rochester, Computer Science Dept, Rochester,New York, 1988.

[8] P. Siebert and C. Urquhart. Active stereo: tex-tured enhanced reconstruction. Electronics letters,26(7):427-430, March 1990.

[9] C. Urquhart. An investigation into active andpassive methods for improving the performance ofscale-space stereo. 5th year project report, BBNSystems and Technologies, Edinburgh, UK, MArch1990.

[10] Y. F. Wang and J. K. Agargwal. Integration of ac-tive and passive sensing techniques for representing3d objects. IEEE Trans. Robotics and Automation,5(4):460-170, 1989.

Acknowledgements: This work is supported under aUK IED project funded in part by the Department ofTrade and Industry, by the Esprit DIMUS project, TheTuring Institute and BBN Systems and Technologies,UK.

References

[1] J. Y. Aloimonos, I. Weiss, and A. Bandyopadhyay.Active /ision. In DARPA image understandingworkshop, pages 552-573, Los Angeles, LA, Feb1987.

[2] P. Besl. Active, optical range imaging sensors. Ma-chine vision and applications, 1:127-152, 1988.

[3] R. H. S. Carpenter. Movements of the eyes. Pion,London,1977.

[4] S. Cochran and G. Medioni. Accurate surface de-scription from binocular stereo. In DARPA im-age understanding workshop, pages 857-869, PaloAlto, LA, May 1989.

[5] Z. Jin and P. Mowfortfh. A discrete approach tosignal matching. Research Memo TIRM-89-036,The Turing Institute, Glasgow, UK, January 1989.

[6] B. Mysliwetz and E. D. Dickmanns. A vision sys-tem with active gaze control for real-time interpre-tation of well structured dynamic scenes. In L. O.

366