5
/ Vol. XXIV, July 2002 ABCM 234 A. M. Kabayama [email protected] L. G. Trabasso Instituto Tecnológico de Aeronáutica Divisão de Engenharia Mecânica-Aeronáutica 12.228-900 São José dos Campos, SP. Brazil [email protected] Performance Evaluation of 3D Computer Vision Techniques This work presents the implementation and comparison of three different techniques of three-dimensional computer vision as follows: Stereo vision - correlation between two 2D images Sensorial fusion - use of different sensors: camera 2D + ultrasound sensor (1D); Structured light The computer vision techniques herein presented took into consideration the following characteristics: Computational effort ( elapsed time for obtain the 3D information); Influence of environmental conditions (noise due to a non uniform lighting, overlighting and shades); The cost of the infrastructure for each technique; Analysis of uncertainties, precision and accuracy. The option of using the Matlab software, version 5.1, for algorithm implementation of the three techniques was due to the simplicity of their commands, programming and debugging. Besides, this software is well known and used by the academic community, allowing the results of this work to be obtained and verified. Examples of three- dimensional vision applied to robotic assembling tasks (“pick-and-place”) are presented. Keywords: computer vision, range finders, robotics, mechatronics Introduction Because of the increasing use of robots in industries, robotics has become an area inside engineering with its own identity. The advance of sensor technologies and their decreasing prices has allowed the construction of robots with more feedback capabilities about their workspace. As a consequence, their positioning accuracy, speed operation and functional flexibility have increased. It is normally accepted that computer vision is the most powerful and flexible way to provide robot with feedback about the environment they interact with and considerable efforts and resources have been spent on research and development, according to Ruocco (1987). 1 The determination of tridimensional data from two- dimensional images is very important in computer vision field and one way to perform that, is by using stereo vision techniques. In a three dimensional scene recovered from a pair of stereo images, two main problems must be solved. The first and more complex one is called ‘stereo matching’, that is, the establishment of correlation between two images of the same object taken from two different points of view. One projected point in the first image must match a point in the second image, which is the projection of the same point of the tridimensional world. The second problem is the geometric calculation of the tridimensional co-ordinate from the pairs of correlated points from both images using triangulation techniques, as shown in Fu (1987). Most of the current stereo vision techniques use area based algorithms. Those algorithms split both images into a number of subregions and a degree of photometric similarity is used to establish the matching between the respective subregions. The sensorial fusion technique uses two types of sensors: a camera (two-dimensional sensor) and a range finder (one- dimensional sensor). This technique uses the computer vision capabilities to perform bidimensional measures like the co- ordinates of the centre of area of an object, as well as its length, width and orientation. Then, the range finder completes the missing information by measuring the height of the object’s centre area ordinate. All length, width and orientation information are Presented at COBEM 1999 – 15th Brazilian Congress of Mechanical Engineering, 22-26 November 1999, São Paulo. SP. Brazil. Technical Editor: José Roberto F. Arruda. used to calculate the attitude of a robotic manipulator and allow it to grab an object in a typical robotic pick and place task. Structured lighting is used mainly in inspection and quality control applications. This method is based on the extraction of geometric information encoded in a bidimensional image. The object height information is extracted by measuring the deformity of a pattern of light projected in its surface. As the model of this deformity behaviour is known, it is possible to recover the object’s height. All the three dimensional computer vision techniques are described in detail as follows. Stereo Vision The stereo vision system herein presented was designed for a robotic ‘pick-and-place’ application, as shown schematically in the scheme in Fig.1 Figure 1. Stereo vision system scheme. The stereo images grabbing process takes place in two steps, with the camera filming the scene from this top view. After the first image is grabbed, the camera is moved away to (0,5 to 1,5 cm range) through a displacement mechanism driven by a step motor. After this displacement, the second image is grabbed. Stereo Vision Implementation The development and tests of the algorithm have been carried out in four steps. The first step was the development of the image correlation routines using MatLab version 5.1 and synthetic images

Performance Evaluation of 3D Computer Vision Techniques A. M. Kabayama · 2003-08-28 · A. M. Kabayama [email protected] L. G. Trabasso Instituto Tecnológico de Aeronáutica

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Performance Evaluation of 3D Computer Vision Techniques A. M. Kabayama · 2003-08-28 · A. M. Kabayama alfred@mec.ita.cta.br L. G. Trabasso Instituto Tecnológico de Aeronáutica

/ Vol. XXIV, July 2002 ABCM 234

A. M. Kabayama [email protected]

L. G. Trabasso Instituto Tecnológico de Aeronáutica

Divisão de Engenharia Mecânica-Aeronáutica 12.228-900 São José dos Campos, SP. Brazil

[email protected]

Performance Evaluation of 3D Computer Vision Techniques This work presents the implementation and comparison of three different techniques of three-dimensional computer vision as follows: • Stereo vision - correlation between two 2D images • Sensorial fusion - use of different sensors: camera 2D + ultrasound sensor (1D); • Structured light The computer vision techniques herein presented took into consideration the following characteristics: • Computational effort ( elapsed time for obtain the 3D information); • Influence of environmental conditions (noise due to a non uniform lighting,

overlighting and shades); • The cost of the infrastructure for each technique; • Analysis of uncertainties, precision and accuracy. The option of using the Matlab software, version 5.1, for algorithm implementation of the three techniques was due to the simplicity of their commands, programming and debugging. Besides, this software is well known and used by the academic community, allowing the results of this work to be obtained and verified. Examples of three-dimensional vision applied to robotic assembling tasks (“pick-and-place”) are presented. Keywords: computer vision, range finders, robotics, mechatronics

Introduction

Because of the increasing use of robots in industries, robotics has become an area inside engineering with its own identity. The advance of sensor technologies and their decreasing prices has allowed the construction of robots with more feedback capabilities about their workspace. As a consequence, their positioning accuracy, speed operation and functional flexibility have increased. It is normally accepted that computer vision is the most powerful and flexible way to provide robot with feedback about the environment they interact with and considerable efforts and resources have been spent on research and development, according to Ruocco (1987).1

The determination of tridimensional data from two-dimensional images is very important in computer vision field and one way to perform that, is by using stereo vision techniques. In a three dimensional scene recovered from a pair of stereo images, two main problems must be solved. The first and more complex one is called ‘stereo matching’, that is, the establishment of correlation between two images of the same object taken from two different points of view. One projected point in the first image must match a point in the second image, which is the projection of the same point of the tridimensional world. The second problem is the geometric calculation of the tridimensional co-ordinate from the pairs of correlated points from both images using triangulation techniques, as shown in Fu (1987). Most of the current stereo vision techniques use area based algorithms. Those algorithms split both images into a number of subregions and a degree of photometric similarity is used to establish the matching between the respective subregions.

The sensorial fusion technique uses two types of sensors: a camera (two-dimensional sensor) and a range finder (one-dimensional sensor). This technique uses the computer vision capabilities to perform bidimensional measures like the co-ordinates of the centre of area of an object, as well as its length, width and orientation. Then, the range finder completes the missing information by measuring the height of the object’s centre area ordinate. All length, width and orientation information are

Presented at COBEM 1999 – 15th Brazilian Congress of Mechanical Engineering, 22-26 November 1999, São Paulo. SP. Brazil. Technical Editor: José Roberto F. Arruda.

used to calculate the attitude of a robotic manipulator and allow it to grab an object in a typical robotic pick and place task.

Structured lighting is used mainly in inspection and quality control applications. This method is based on the extraction of geometric information encoded in a bidimensional image. The object height information is extracted by measuring the deformity of a pattern of light projected in its surface. As the model of this deformity behaviour is known, it is possible to recover the object’s height. All the three dimensional computer vision techniques are described in detail as follows.

Stereo Vision

The stereo vision system herein presented was designed for a robotic ‘pick-and-place’ application, as shown schematically in the scheme in Fig.1

Figure 1. Stereo vision system scheme.

The stereo images grabbing process takes place in two steps,

with the camera filming the scene from this top view. After the first image is grabbed, the camera is moved away to (0,5 to 1,5 cm range) through a displacement mechanism driven by a step motor. After this displacement, the second image is grabbed.

Stereo Vision Implementation

The development and tests of the algorithm have been carried out in four steps.

The first step was the development of the image correlation routines using MatLab version 5.1 and synthetic images

Page 2: Performance Evaluation of 3D Computer Vision Techniques A. M. Kabayama · 2003-08-28 · A. M. Kabayama alfred@mec.ita.cta.br L. G. Trabasso Instituto Tecnológico de Aeronáutica

J. of the Braz. Soc. Mechanical Sciences Copyright 2002 by ABCM July 2002, Vol. XXIV / 235

generated by 3D Studio, Version 4.0. At the beginning, the work was conducted with lower resolution images (200x300 pixels) to make the routines development process faster due to the huge computational effort involved in this technique. As soon as the correlation algorithm parameters were optimised and settled to provide a good performance, the second step took place, where the real image, grabbed by the camera with 512x512 pixels and 64 grey levels resolution, replaced the synthetic images.

The third step was the development of a process for calibrating intrinsic camera lens parameters. The most important parameter is the focal distance that is used to recover the object’s height information.

The fourth step was the recovery of tridimensional data about the scene from 2D images. From the first image, which is always grabbed at an initial position, the information about objects in scene is achieved in pixels, such as length and width. To recover metric information about objects, it is necessary to find out the relationship between metric and pixels scales.

The information about object height is calculated through a simple triangulation technique from geometric optics model of stereo images configuration, as shown in Fu (1987).

Image Correlation Process Using Relaxed Rotulation

Labelling

‘Relaxed rotulation labelling processes are iterative procedures heuristically developed to solve certain constraint satisfaction problems, which have become a standard technique in computer vision and pattern recognition’, according to Pelillo (1994). The result search is a coarse to fine process, which ends when the iterative procedure reaches the required error. These algorithms are associated to a ‘energy’ function, quantifying the degree of violation of constraints, which is minimised as the process evolves.

The stereo matching algorithm implemented in this paper has been proposed by Feris (1998).

Camera model and camera calibration

The camera model adopted is the ‘pinhole’ model, as shown in Fu (1987) and in Nalwa (1993): the tridimensional projection of an object into a bidimensional surface is achieved through straight lines that pass through one single point called ‘projection centre’. Camera calibration is the determination of all its inner geometric and optics features. These inner features are called intrinsic parameters. Camera calibration also means the determination of its position and orientation relative to the world co-ordinate system. These features are called extrinsic parameters. Laudares (1997) presents in detail an extrinsic camera calibration process, which is quite suitable for the robotic application proposed in this work. The most important camera intrinsic parameter is the focal distance λ, which is the distance between the centre of the lens and the image sensor plane.

3D Object Height Recovery ( Triangulation)

The following conditions must be met for the model shown in the Fig 2:

Figure 2. Stereo image geometric model.

• The cameras are identical; • Both images co-ordinates systems are perfectly aligned,

with different origin location; • The Z co-ordinate is the same for both images; According to Fu (1987), the depth information recovery (Z co-

ordinate) is achieved by the following expression:

12

.xx

BZ

−−=

λλ (1)

where:

• λ is the focal distance, estimated by an experimental calibration process.

• x2 and x1 are co-ordinates in pixels, that must be converted in metric scale by multiplying them by a scale factor that relates the size of sensor in x direction, in meters, and, sensor resolution, in pixels, in the same direction.

• B is the baseline, that is the displacement between two optical centres.

Some improvements on Feris (1998) technique were included

in order to increase the algorithm performance and ease the correlation process, as shown in Kabayama (1999).

Further information about focal distance and scale factor processes procedures and calibration results can be found in Kabayama (1999).

Table 1 shows the results of some objects height measures using 30mm baseline displacement.

Table 1. Results of some object measurements for 30mm baseline displacement.

Baseline B=30mm ±0,5mm

Real Correlation algorithm data Height

cm ±0,5mm Height (cm)

Error ( % )

Disparity ( pixels )

Matches established

4,90 5,2664 7,48 61,5325 77 7,05 6,5745 6,74 62,8434 83 9,87 10,2612 3,96 65,2917 48

11,35 11,4167 0,59 66,9365 63 Disparity is the difference between respective x co-ordinates in

both images and matches established is the number of correlated points.

Page 3: Performance Evaluation of 3D Computer Vision Techniques A. M. Kabayama · 2003-08-28 · A. M. Kabayama alfred@mec.ita.cta.br L. G. Trabasso Instituto Tecnológico de Aeronáutica

236/ Vol. XXIV, July 2002 ABCM

Sensorial Fusion

The conception of the sensorial fusion technique for 3D-vision machine is shown in the Fig. 3.

Figure 3 Computer vision system and ultrasound sensor in a sensorial fusion scheme.

The sensor used was the Honeywell‘s 946-A4V-2D-2CO-

175E ultrasound sensor, which has the following features: Minimum range: 200 mm ( programmable ) Maximum range: 2000mm ( programmable ) Weight: 0.210 kilograms The sensor provides an analogic tension output proportional to

the distance to be measured. This proportional pattern can behave in a direct or in an inverse way, depending on how the sensor is programmed (rising or falling modes). The curves of this sensor relate output tension variation as a function of distance were determined using both proportional modes for different range programs. The results showed that the ultra sound sensor has a linear behaviour in all modes and this is an important and desirable feature. The respective static calibration coefficients for each curve were calculated and they are necessary to sensitivity to establish the relationship between the output tension and the distance measured and for evaluating the sensitivity of the programmed mode for noise and the resolution.

As for ultrasound beam characteristics, as shown in Fig. 4, it was necessary to study the radial profile and the results are shown in Table 2.

Figure 4. Scheme used for ultrasound beam profile measurement process.

Table 2. Ultrasound bean profile diameter measurements for several levels with 2 meters maximum range programming.

Ranges (cm) ±0,5mm Ultrasound beam diameter (cm) ±0,5cm 20 3 30 8 40 16 90 13

140 11 180 6

The distances shown in Table 2 refer to the object top. The determination of the ultrasound beam diameter in a given

level was performed in an experimental way: on the surface of the testing table, a grid paper has been fixed and the sensor positioned in a certain range from it, aimed at its direction. An object was moved on this surface towards the place that the sensor was pointing at. As soon as the object was detected, the place where that happened was marked. This procedure was repeated until a complete ultrasound profile in this level was determined. This entire process was repeated for other levels, as shown in the Table 2.

From the knowledge about the sensor features, it is possible to estimate the minimum size of the object that can be manipulated using this technique. For example, at 40cm range, the object size must be 16cm at least. The size of the objects cannot be smaller than the diameter of ultrasound beam in the object top. Besides, the object material should not absorb the ultrasound waves and the object’s top must be perpendicular to the direction that the ultrasound beam may reach it.

Structured Lighting

Two different lighting patterns were studied to evaluate accuracy and check if this technique is suitable for a pick-and-place robotic application.

The first pattern studied was a source of laser light from a presentation pointer device. An external DC power source was adapted to the device in order to avoid the decreasing light intensity due to batteries flattering process. Fig. 5 shows a general view of the experiment using a laser light source and its respective model [Galbiati (1990)] to recover the object height h information.

Figure 5. Structured lighting experiment scheme using laser light as pattern and object’s height model.

The full line and the dotted line shown in Fig. 5 represent two

distinct situations in the data acquisition process for the object height h information recovery.

The scene is filmed twice. At the first shot, represented by the dotted line, the object is out of scene and P1 is the position of the laser beam centre area where it touches the ground. At second shot, represented by the full line, the object is in scene and P2 is the position of the laser beam centre area where it touches the top of

Page 4: Performance Evaluation of 3D Computer Vision Techniques A. M. Kabayama · 2003-08-28 · A. M. Kabayama alfred@mec.ita.cta.br L. G. Trabasso Instituto Tecnológico de Aeronáutica

J. of the Braz. Soc. Mechanical Sciences Copyright 2002 by ABCM July 2002, Vol. XXIV / 237

the object. The laser beam area centre is determined by computer vision system in both situations. P3 is the P2 projection in the horizontal plane. The laser beam reaches the ground with a θθ angle and d is the distance, in pixels, between area centres P1 and P2. The object h height is determined by a simple trigonometric relation ( see Fig. 5 ):

h = d . s . tgθ (2) where: d ∈ Z+;

s ∈ R+, is an experimental conversion factor between centimeters and pixels; θθ ∈ Z+ , is the angle between P1P2 and P1P3 line segments.

Implementation

The first step was the determination of the conversion factor s. An A4 sheet of paper has been used as a standard object shot five times. Then, the object size in pixels, in each image, were measured. The s conversion factor yields from the average rate between the standard object size measured in centimeters and in pixels. It is important to take note if this conversion factor is determined at the same direction of distance variation because the camera pixels are not square shaped.

The s factor determinated was: s = 21.1 / 387 (cm / pixel)

The second step was the θθ angle calibration procedure. Five different objects with known heights h were used as calibration standards and shot five times. Then, using Eq. (3), the respective angles θθ were calculated for each distance d measured.

=

sdh

arctg.

θ (3)

The angle θθ calibration results are shown in Table 3.

Table 3. Results from θθ calibration process of laser light source as structured light pattern.

Height h (cm) ±0,5mm Average of 5 angles (degrees) 1,5 44,9280 4,5 45,2576 6,9 46,3332

11,7 45,6472 14,7 42,9280

Average angleà 45,0188 Standard deviationà 1,2805

After completing the s and θθ calibration processes, the 3D

structured lighting computer vision system using laser light source was ready to be operated. Different object heights were measured and the results are shown in Table 4.

Table 4. Objects measurements after s and θθ calibration using laser light source as structured light pattern.

Actual height h (cm) ±0,5mm

Measured height (cm) average of 5 measures

Error (%)

6,1 5,9795 1,97 9,3 9,3187 0,20

15,9 16,7929 5,62 The second pattern used in this experiment was a single stripe

projected by an overhead projector. Fig. 6 shows a general view of experiment and the Fig. 7 shows a projected pattern detail.

Figure 6. General view of experiment.

Figure 7. Projected pattern detail.

In this case, the object height information recovery is similar to the laser case, using the same principles and equations. The difference is that applying stripe pattern yields in three object recognition by computer vision system as shown in Fig.7. Due to digitalisation errors, the alignment of O2 and O3 objects can not always be obtained. Because of this, the distance d is the average of the distance between O2 and O1 and the distance between O3 and O1.

The conversion factor s used is the same as in the previous experiment. The θθ angle calibration process was repeated, using five standard objects. The results are shown in Table 5.

Table 5. Results from θθ calibration process of single stripe as structured light pattern.

5 angles average (degrees) Average angleà 38,0321

Standard deviation à 1,5887 After completing the s and θθ calibration processes, the 3D

structured lighting computer vision system using the single stripe pattern was ready to use. Different object heights were measured using the Eq. (2). Some results are shown in Table 6.

Table 6. Objects measurements after s and θθ calibration using a single stripe as structured light pattern.

Actual height h (cm) ±0,5mm

Measure height (cm) average of 5 measures

Error (%)

6,2 6,0345 2,67 7,8 7,9260 1,61

11,2 12,0865 7,91

Page 5: Performance Evaluation of 3D Computer Vision Techniques A. M. Kabayama · 2003-08-28 · A. M. Kabayama alfred@mec.ita.cta.br L. G. Trabasso Instituto Tecnológico de Aeronáutica

238/ Vol. XXIV, July 2002 ABCM

Analysis and Conclusions

Computer vision is a field with great potentials to be exploited and this work shows that there are still many subjects to be tested and improved. Despite all literature available about the techniques exposed in the current work, only the practical part shows the researcher the difficulties involved in their implementation and allows the evaluation about what assumptions and measurements can be taken.

The errors obtained in the objects measurements in the stereo vision and structured lighting techniques experiments implemented are acceptable in typical pick-and-place applications due to the robot end effector compliance compensation, even considering the reported worst case, that was 1cm error.

Table 7 presents the analysis and conclusions compiled from experimental data and from difficulties faced during the implementation of each technique exposed.

Table 7. Comparative table including parts costs, computational effort, environmental influence and systems improvements suggestions.

Parts costs (US$) Technique

common Specifics

Computational effort Environment influence System improvements sugestions

Camera (2D) +

range finder (1D)

Ultrasound sensor: 800,00

A/D converter board:

1.200,00

It has the lowest computational effort

External environment lighting interference, that is common in all

techniques

Use "Back lighting" illumination system

US$ 150,00

Structured light

Overhead projetor: 400,00

laser pointer: 10,00

Low, the result is obtained less than one second, after

image capture

Lack of contrast between structured light patterm , the

object and the backgroud.

Environmetal lighting insulation US$130,00

Stereo vision

PC microcomputer

: 1.000,00 Computer

vision system: camera,

frame grabber, software, lenses and monitor: 2.000,00 Camera

displacement mechanism

500,00

High: measures were obtained in: Best case: 40

seconds Worse case: 4

minutes

The illumination system can not produce overlighting or shades

that generate any ambiguous situations or misundertanding to correlation algorithm.

Algorithims optimization and their implementaion in a

compiled language like C

References:

Feris, R.S. & Lages, W.F., 1998, Casamento de Imagens utilizando correlação e rotulação relaxada, Anais do IV ENCITA- Encontro de Inciação Científica e Pós-Graduação do ITA

Fu, K.S. & Gonzales, R.C. & Lee C.S.G., 1987, Robotics, Mc Grall Hill,

Galbiati, L. J., 1990, Machine vision and digital image processing fundamentals , Prentice Hall

Kabayama, A. M., 1999, Implementação e análise de técnicas de fusão sensorial aplicadas à robótica, Tese de mestrado ITA.

Laudares, D., 1997, Procedimento automático para calibração de sistemas de visão robótica para operações pick-and-place, Tese de mestrado ITA

Nalwa, V.S.,1993, A guided tour of computer vision, Adson-Wesley Publishing Company

Pelillo, Marcello, 1994, On Dynamics of Relaxation Labeling Processes, IEEE Transactions on pattern analysis and machine intelligence

Ruocco S.R., 1987, Robot sensors and transducers, Open University Press Robotics Series, Edited by P.G. Davey