Efficient hierarchical matching algorithm for processing uncalibrated stereo vision images and its hardware architecture

1

An Efficient Hierarchical Matching

Algorithm for Processing Uncalibrated Stereo

Vision Images and its Hardware ArchitectureLazaros Nalpantidis1, Angelos Amanatiadis2, Georgios Sirakoulis2 and Antonios Gasteratos1

1Laboratory of Robotics and Automation, Department of Production and Management Engineering,

Democritus University of Thrace, University Campus Kimmeria, GR-67100, Xanthi, Greece

Email: {lanalpa, agaster} @pme.duth.gr2Laboratory of Electronics, Department of Electrical and Computer Engineering,

Democritus University of Thrace, 12 Vas. Sofias Str., GR-67100, Xanthi, Greece

Email: {aamanat, gsirak} @ee.duth.gr

Abstract

In motion estimation, the sub-pixel matching technique involves the search of sub-sample positions as well as

integer-sample positions between the image pairs, choosing the one that gives the best match. Based on this idea,

this work proposes an estimation algorithm, which performs a 2-D correspondence search using a hierarchical

search pattern. The intermediate results are refined by 3-D Cellular Automata (CA). The disparity value is then

defined using the distance of the matching position. Therefore, the proposed algorithm can process uncalibrated,

non-rectified stereo image pairs, maintaining the computational load within reasonable levels. Additionally, a

hardware architecture of the algorithm is deployed. Its performance has been evaluated on both synthetic and

real self-captured image sets. Its attributes, make the proposed method suitable for autonomous outdoor robotic

applications.

Index Terms

Disparity estimation, hierarchical matching, uncalibrated stereo vision, outdoor robotics.

I. INTRODUCTION

Outdoor robotics place strict constraints on the used algorithms [1], [2]. The lighting conditions in outdoor

environments are far from being ideal. A stereo camera, which acquires two displaced views of the same

scene, is very sensitive to such conditions [3], [4]. Moreover, the rough terrain and the bounces it causes to a

moving robot often decalibrate the cameras of a stereo acquisition array. Autonomous operation demands high

processing frame-rates. On the other hand, a robotic platform can provide limited computational resources,

power and payload capacity for many different onboard applications. These facts differentiate the priorities of

stereo vision algorithms intended for use in outdoor operating robots from those listed in [5]. The algorithms

listed in the aforementioned site compete based on their accuracy on four perfectly lighted, calibrated and

June 20, 2010 DRAFT

2

rectified image sets without any timing or computational constraints. However this not the issue in outdoor

environments, and consequently a comparison of the proposed algorithm with those would be pointless.

This work presents a novel stereo vision algorithm, inspired by recent motion estimation techniques. The

proposed algorithm has been adapted to the demands of the contemporary outdoor robotic applications. It is

based on a fast executed Sum of Absolute Differences (SAD) core for correspondence search in both directions

of the input images. The results of this core are enhanced using sophisticated computational techniques; gaussian

weighted aggregation and 3-D CA rules are used [6]. The hierarchical iteration of the basic stereo algorithm

was achieved using a fuzzy scaling technique [7]. The aforementioned characteristics provide improved quality

of results, being at the same time easy to be hardware implemented. As a result, the presented algorithm is

able to cope with uncalibrated input images. On the other hand, the hardware implementation of the already

proposed algorithms found in literature, is not always straightforward [8]. Nevertheless the need of hardware

implementation of efficient and robust stereo algorithms able to provide real-time frame rates especially in the

case of moving robots is inevitable. The allure of hardware implementations is that they easily outperform the

algorithms executed on a computer. The achieved frame-rates are generally higher. The power consumed by a

dedicated hardware platform, e.g. ASIC or FPGA, is considerably lower than that of a common microprocessor.

Moreover, the computational power of the robot’s onboard available PCs is left intact. As a result, in this work

we also propose an efficient hardware implementation, based on the proposed parallel techniques, which can

serve the aforementioned needs and provide both qualitative results and real-time frame rates at the same time.

Additionally, the perception of object’s depth in a scenery is crucial for autonomous robotic applications. The

accuracy of the depth estimation results and their rapid refresh rate, as a robot moves, is the cornerstone of plenty

more high level behaviors. Since our proposed scheme is block matching based and does not perform scanline

pixel matching, it does require neither camera calibration nor image rectification. However, it is clear that block

matching approaches require more computational resources since the number of pixels to be considered is

greatly increased. In order to address this problem, the proposed algorithm is a variation of a motion estimation

algorithm [9] which is used for JVT/H.264 video coding [10]. The adaptation of compression motion estimation

algorithms into disparity estimation schemes can be effective both in accuracy and complexity terms, since

compression algorithms also attempt to achieve complexity reduction while maintaining coding efficiency. On

the other hand, CA have been employed as a intelligent and efficient way to refine and enhance the stereo

algorithm’s intermediate results.

In the length of this paper, a brief introduction in stereo vision, for readability reasons, is presented in section

II. The related work is found in section III, while the proposed estimation disparity algorithm is described in

full detail in section IV. The corresponding hardware architecture of the proposed algorithm is discussed in

section V, and the resulting experimental results are shown in section VI. Finally, the conclusions are drawn in

section VII.

II. STEREO VISION

The estimation of the disparity between two images of the same scene is a long-standing issue for the

machine vision community [11]. Stereoscopic vision is based on the principal, first utilized by nature, that two

spatially differentiated views of the same scene provide enough information so as to perceive the depth of the

June 20, 2010 DRAFT

https://www.researchgate.net/publication/221239054_A_Dense_Stereo_Correspondence_Algorithm_for_Hardware_Implementation_with_Enhanced_Disparity_Selection?el=1_x_8&enrichId=rgreq-6c2ff2ff-e7d0-4525-acaa-826b425fede7&enrichSource=Y292ZXJQYWdlOzIyNDI1MDIyMztBUzoxMDEyNDM1OTA1NDU0MjBAMTQwMTE0OTc2MTg5Nw==

https://www.researchgate.net/publication/22185446_Cooperative_Computation_of_Stereo_Disparity?el=1_x_8&enrichId=rgreq-6c2ff2ff-e7d0-4525-acaa-826b425fede7&enrichSource=Y292ZXJQYWdlOzIyNDI1MDIyMztBUzoxMDEyNDM1OTA1NDU0MjBAMTQwMTE0OTc2MTg5Nw==

https://www.researchgate.net/publication/220597603_Overview_of_the_H264AVC_video_coding_standard?el=1_x_8&enrichId=rgreq-6c2ff2ff-e7d0-4525-acaa-826b425fede7&enrichSource=Y292ZXJQYWdlOzIyNDI1MDIyMztBUzoxMDEyNDM1OTA1NDU0MjBAMTQwMTE0OTc2MTg5Nw==

https://www.researchgate.net/publication/221119172_Fast_mode_decision_and_motion_estimation_for_JVTH264?el=1_x_8&enrichId=rgreq-6c2ff2ff-e7d0-4525-acaa-826b425fede7&enrichSource=Y292ZXJQYWdlOzIyNDI1MDIyMztBUzoxMDEyNDM1OTA1NDU0MjBAMTQwMTE0OTc2MTg5Nw==

https://www.researchgate.net/publication/52006577_Review_of_Stereo_Vision_Algorithms_From_Software_to_Hardware?el=1_x_8&enrichId=rgreq-6c2ff2ff-e7d0-4525-acaa-826b425fede7&enrichSource=Y292ZXJQYWdlOzIyNDI1MDIyMztBUzoxMDEyNDM1OTA1NDU0MjBAMTQwMTE0OTc2MTg5Nw==

https://www.researchgate.net/publication/3094378_Design_and_Implementation_of_a_Fuzzy_Area-Based_Image-Scaling_Technique?el=1_x_8&enrichId=rgreq-6c2ff2ff-e7d0-4525-acaa-826b425fede7&enrichSource=Y292ZXJQYWdlOzIyNDI1MDIyMztBUzoxMDEyNDM1OTA1NDU0MjBAMTQwMTE0OTc2MTg5Nw==

3

(a) (b)

Fig. 1. Geometry of epipolar lines, where C1 and C2 are the left and right camera lens centers, respectively. (a) Point P1 in one image

plane may have arisen from any of the points in the line C1P1, and may appear in the alternate image plane at any point on the epipolar

line E2; (b) In the case of non-rectified images, point P1 may have arisen for any of the points inside the block B.

portrayed objects. Thus, the importance of stereo correspondence is apparent in the fields of machine vision,

virtual reality, robot navigation, simultaneous localization and mapping [12], [13], depth measurements and 3-D

environment reconstruction [14].

The two alternatives for estimating disparity are either to precisely align the stereo camera rig and then

perform the demanded rectification (leading to simple scanline searches), or to have arbitrary stereo cameras

setup and avoid any rectification (performing searching throughout blocks). Accurately aligned stereo devices

are very expensive, as they demand calibration of a series of factors in micrometer scale [15]. On the other hand,

non-ideal stereo configurations usually produce inferior results, as they fail to satisfy the epipolar constraint.

To reduce disparity estimation complexity, the used cameras are usually arranged in a parallel-axis configura-

tion or equivalently, the stereo image pairs are carefully rectified using the camera geometry. The main function

of a stereo correspondence algorithm is to match each (in the case of dense stereo) or some (in the case of

sparse stereo) pixels of the first image to their corresponding ones in the second image. The outcome of this

process is a depth image, i.e. a disparity map. This matching can be done as a 1-D search if the stereo pairs

are accurately rectified. In rectified pairs horizontal scan lines reside on the same epipolar lines, as shown in

Fig. 1(a)

A point P1 in one image plane may have arisen from any of points in the line C1P1, and may appear in the

alternate image plane at any point on the so-called epipolar line E2. Thus, the search is theoretically reduced

within a scan line, since corresponding pair points reside on the same epipolar line. The difference on the 2

horizontal coordinates of these points is the disparity. The disparity map results from assigning to each matched

pixel the disparity value of its correspondent. Afterwards, the depth of the scenery can be derived from the

disparity map, since the more distant an object is, the larger disparity value it is expected to have.

However, this approach make disparity algorithms application restrictive and even a small parallel axis

deviation introduces great disparity errors. On the other hand, the more general case of non-rectified image

pairs, shown in Fig. 1(b) leads a boost of demanded calculations. In this case, all the points inside a 2-D area

have to be considered as possible matches, in contrast to the previously mentioned 1-D search.

June 20, 2010 DRAFT

https://www.researchgate.net/publication/221238809_Factors_Affecting_the_Accuracy_of_an_Active_Vision_Head?el=1_x_8&enrichId=rgreq-6c2ff2ff-e7d0-4525-acaa-826b425fede7&enrichSource=Y292ZXJQYWdlOzIyNDI1MDIyMztBUzoxMDEyNDM1OTA1NDU0MjBAMTQwMTE0OTc2MTg5Nw==

https://www.researchgate.net/publication/220474434_Using_Real-Time_Stereo_Vision_for_Mobile_Robot_Navigation?el=1_x_8&enrichId=rgreq-6c2ff2ff-e7d0-4525-acaa-826b425fede7&enrichSource=Y292ZXJQYWdlOzIyNDI1MDIyMztBUzoxMDEyNDM1OTA1NDU0MjBAMTQwMTE0OTc2MTg5Nw==

https://www.researchgate.net/publication/3701684_Stereo_Vision_Based_Mapping_and_Navigation_for_Mobile_Robots?el=1_x_8&enrichId=rgreq-6c2ff2ff-e7d0-4525-acaa-826b425fede7&enrichSource=Y292ZXJQYWdlOzIyNDI1MDIyMztBUzoxMDEyNDM1OTA1NDU0MjBAMTQwMTE0OTc2MTg5Nw==

4

III. RELATED WORK

Autonomous robots’ behavior greatly depends on the accuracy of their decision making algorithms [16]. In

the case of stereo vision-based navigation, the accuracy and the refresh rate of the computed disparity maps are

the cornerstone of its success [17]. Dense local stereo correspondence methods calculate depth for almost every

pixel of the scenery, talking into consideration only a small neighborhood of pixels each time [18]. On the other

hand, global methods are significantly more accurate but at the same time more computationally demanding,

as they account for the whole image [19]. However, since the most urgent constraint in autonomous robotics

is the real-time operation, such applications usually utilize local algorithms.

In the relevant literature, many stereo vision based methods have been proposed for outdoor navigation. Much

attention has been given to the issue by the automotive industry [20], [2] and enhancement methods have been

proposed [21]. Many advanced behaviors base their success upon the accuracy of the stereo vision based depth

calculation [22], [1], [23]. These works employ a fast SAD-based algorithm to provide the necessary depth

maps. The availability of reliable and noise-free disparity maps is evident. On the other hand, common issues

related to outdoor exploration, such as possible decalibration of the stereo system and tolerance to non-perfect

lighting conditions, are not addressed. Furthermore, to the best of our knowledge the majority of published

algorithms used for outdoor robot navigation have not been implemented on hardware. Unfortunately, there is

no commonly agreed test-bench for outdoor-focused real-time stereo algorithms, analogous to [5].

The issue of processing non-rectified images is common to applications where the sensory system is not

explicitly specified. The plethora of computations most commonly require the massive parallelization found in

custom tailored hardware implementations. The contemporary powerful graphics machines are able to achieve

enhanced results in terms of processing time and data volume. A hierarchical disparity estimation algorithm

implemented on programmable 3-D graphics processing unit is reported in [24]. This method can process either

rectified or non-rectified image pairs. Bidirectional matching is utilized in conjunction with a locally aggregated

sum of absolute intensity differences. This implementation, on an ATI Radeon 9700 Pro, can achieve up to 50 fps

for 256× 256 pixel input images. The FPGA implementation presented in [25] uses the dynamic programming

search method on a Trellis solution space. It copes with the vergent cameras case, i.e. cameras with optical

axes that intersect arbitrarily, producing non-rectified stereo pairs. The image pairs received from the cameras

are initially rectified using linear interpolation and then, during a second step, the disparity is calculated. The

architecture has the form of a linear systolic array using simple processing elements. The design is canonical

and simple to be implemented in parallel. The implementation requires 208 processing elements. The resulting

system can process 1280 × 1000 pixel images with up to 208 disparity levels at 15 fps. An extension of the

previous method is presented in [26]. The main difference is that data from previously processed lines are

incorporated so as to enforce better inter-scanline inconsistency. The running speed is 30 fps for 320 × 240

pixel images with 128 disparity levels. The number of utilized processing elements is 128. The percentage of

pixels with disparity error larger than 1 in the non-occluded areas is 2.63, 0.91, 3.44, and 1.88 for the Tsukuba,

Map, Venus and Sawtooth image sets, respectively. Finally, [27] proposes the utilization of the local weighted

phase-correlation method. The platform used is the Transmogrifer-4 containing four Altera Stratix S80 FPGAs.

The system performs rectification and left-right consistency check to improve the accuracy of the results. The

June 20, 2010 DRAFT

5

Absolute

Differences

Calculation

Matching Cost Aggregation

Gaussian

Weighted

Aggregation

CA

Refinement

Disparity

Selection Disparity map

Right Image

Left Image

Fig. 2. Block diagram of the proposed stereo correspondence algorithm.

speed for 640 × 480 pixel images with 128 disparity levels is 30 fps.

A detailed taxonomy and presentation of dense stereo correspondence algorithms can be found in [18].

Additionally, the recent advances in the field as well as the aspect of hardware implementable stereo algorithms

are covered in [8].

IV. PROPOSED DISPARITY ESTIMATION ALGORITHM

A. Stereo Correspondence Algorithm

The proposed system utilizes a simple, rapidly executed stereo correspondence algorithm applied to each

stereo pair. Stereo disparity is computed using a typical three-stage local stereo correspondence algorithm

[6]. The main merit of the used algorithm is the low computational complexity. It is essentially a simple

SAD algorithms, with an aggregation step enhanced by CA rules. The algorithm is focused on achieving

high processing frame-rates. The structural elements of this algorithm are presented in Fig. 2. The stereo

correspondence algorithm wittingly involves a non-iterative disparity selection step. Moreover, the frequently

utilized additional final refinement, i.e. filtering or interpolation, step is absent for speed reasons.

The matching cost function utilized is the Absolute Differences (AD) which is inherently the simplest metric

of all, involving only summations and finding absolute values.

cost(x, y, d) = |Ileft(x, y) − Iright((x − d), y)| (1)

where Ileft and Iright denote the intensity values for the left and right image respectively, d is the value of the

disparity under examination ranging for 0 to D − 1 and x, y are the coordinates of the pixel on the i, j plane.

Disparity Space Image (DSI) is a 3-D matrix, i.e. the i× j ×D volume, containing the computed matching

costs for every pixel and for all its potential disparity values, as shown in Fig. 3(a). The DSI values for

constant disparity value are aggregated inside fix-sized square windows. The dimensions (2w + 1)× (2w + 1)

of the chosen aggregation window play an important role in the quality of the final result. Generally, small

dimensions preserve details but suffer from noise. On the contrast, large dimensions may not preserve fine

details but significantly suppress the noise. However, the choice of the aggregation window’s dimensions has

to take into account the size of the processed input images. That is, the dimensions have to scale according to

the image size in order to obtain the same amount of detail and noise suppression.

The AD aggregation step of the proposed algorithm is a weighted summation. Each pixel is assigned a weight

depending on its Euclidean distance from the central pixel. A 2-D Gaussian function determines the weights

value for each pixel. The center of the function coincides with the central pixel. The standard deviation is

equal to the distance from the central pixel to the nearest window-border. The applied weighting mask can be

June 20, 2010 DRAFT

6

(a) DSI containing matching costs for

every combination of x, y, d.

(b) A 3×3 neighborhood

inside the DSI.

Fig. 3. Views of the Disparity Space Image (DSI).

calculated once and then be applied to all the aggregation windows without any further change. The weighted

summation can be also considered as a convolution between the cost function values and a weighting mask

matrix. Thus, the computational load of this procedure is kept within reasonable limits.

aggr(x, y, d) =w∑

i=−w

w∑

j=−w

mask((x + i), (y + j), d) · cost(x, y, d) (2)

where, the pixel ranges [−w, w] define the weighted aggregation window.

Afterwards, CA rules are used in order to refine further the procedure results. CA were first introduced by

von Neumann [28], who was thinking of imitating the behavior of a human brain in order to build a machine

able to solve very complex problems [28]. His ambitious project was to show that complex phenomena can,

in principle, be reduced to the dynamics of many identical, very simple primitives, capable of interacting and

maintaining their identity [29]. Following a suggestion by [30], von Neumann adopted a fully discrete approach,

in which space, time and even the dynamical variables were defined to be discrete. Consequently, CA comprise

a very effective computational tool in simulating physical systems and solving scientific problems, because they

can capture the essential features of systems where global behavior arises from the collective effect of simple

components which interact locally [31], [32]. In CA analysis, physical processes and systems are described by

a cell array and a local rule, which defines the new state of a cell depending on the states of its neighbors. All

cells can work in parallel due to the fact that each cell can independently update each own state. Therefore the

proposed CA model/algorithm is massively parallel and is an ideal candidate to be implemented in hardware

[33], [34], [35].

Two CA transition rules are applied to 3 × 3 pixel neigborhood inside the DSI, as shown in Fig. 3(b).

The values of parameters used by them were determined after extensive testing to perform best. The first rule

attempts to resolve disparity ambiguities. It checks for excessive consistency of results along the disparity (d)

axis and, if necessary, corrects on the perpendicular (i, j) plane. The second rule is placed in order to smoothen

the results and at the same time to preserve the details. It checks and acts on constant-disparity planes.

1) if at least one of the two pixels lying from either sides of a pixel across the disparity axis (d) differs

from the central pixel less than half of its value, then its value is further aggregated within its 3×3 pixel,

constant-disparity neighborhood.

First CA rule Pseudocode

June 20, 2010 DRAFT

7

if {

|aggr(x,y,d)-aggr(x,y,d-1)|

< (1/2)aggr(x,y,d) }

or {

|aggr(x,y,d)-aggr(x,y,d+1)|

< (1/2)aggr(x,y,d) }

then {

for i,j = (-1,0,1) {

aggr(x,y,d) =

(1/9)sum(sum(aggr(x+i,y+j,d) }}

2) if there are at least 7 pixels in the 3× 3 pixel neighborhood which differ from the central pixel less than

half of the central pixels value, then the central pixels value is scaled down by the factor 1.3 as dictated

by exhaustive testing.

Second CA rule Pseudocode

for i,j = (-1,0,1) {

while (i and j)<>0 {

if {

|aggr(x+i,y+j,d)-aggr(x,y,d)|

< (1/2)aggr(x,y,d) }

then {

count++ }}}

if {

count>=7 }

then {

aggr(x,y,d)=(1/1.3)aggr(x,y,d) }

The two rules are applied once. Their outcome comprises the enhanced DSI that will be used in order the

optimum disparity map to be chosen by a simple, non-iterative Winner-Takes-All (WTA) final step.

In the last stage the best disparity value for each pixel is decided by a WTA selection procedure. For each

image pixel coordinates (i, j) the smaller value is searched for on the d axis and its position is declared to be

the pixels disparity value. That is:

disp(x, y) = arg(min(aggr(x, y, d))) (3)

B. Hierarchical Matching Technique

The most computational intensive process for a dense disparity algorithm is the accurate pixel matching for

each pair of images. In this section we will describe the adapted motion estimation scheme that we used in our

implementation for disparity estimation, which was first proposed in [9] for JVT/H.264 video coding [10].

June 20, 2010 DRAFT

8

Fig. 4. Quadruple, double and single pixel sample matching algorithm.

Fig. 5. General scheme of the proposed hierarchical matching disparity algorithm. The search block is enlarged for viewing purposes.

Let the maximum expected horizontal disparity for a stereo image pair be D. The dimensions of the stereo

pixel matching search block are D×D. For each search block, the disparity value is determined by the horizontal

distance of the (single pixel sampling) best match in terms of minimum sum of absolute differences, as shown

in Fig. 4. In the first stage, the disparity algorithm finds the best match on the quadruple sample grid (circles).

Then, the algorithm searches the double pixel positions next to this best match (squares) to assess whether the

match can be improved and if so, the single pixel positions next to the best double pixel position (triangles) are

then explored. The general scheme of the proposed hierarchical matching disparity algorithm between a stereo

image pair is shown in Fig. 5. Each of the intermediate disparity maps of the first two steps are used as initial

conditions for the succeeding, refining correspondence searches.

In order to perform the hierarchical disparity search three different versions of the input images are employed

and the stereo correspondence algorithm is applied to each of these three pairs. The quadruple search step is

performed as a normal pixel-by-pixel search, on a quarter-size version of the input images. That is, each of the

initial images has been down-sampled to 25% of their initial dimensions. The quadruple search is performed

by applying the stereo correspondence algorithm in (D/4) × (D/4) search regions, on the down-sized image

pair (D being the maximum expected horizontal disparity value in the original image pair). The choice of the

maximum searched disparity D/4 is reasonable as the search is performed on a 1/4 version of the original

images.

The window value 2w + 1 used in this stage is 9, i.e. w = 4. Once the best match is obtained for each

pixel, another correspondence search is performed in 3 × 3 search regions, on a half-size version of the initial

June 20, 2010 DRAFT

9

Upscale by

factor 2

Left Image

Stereo

Correspondence

Algorithm

Downsize to

25%

Downize to

25%

Stereo

Correspondence

Algorithm

Downscale

to 50%

Downscale

to 50%

Quadruple Search

Disparity Map

Right Image

Stereo

Correspondence

Algorithm

Final

Disparity map

Double Search

Disparity Map

Upscale by

factor 2

Fig. 6. Block diagram of the hierarchical disparity search algorithm.

image pair. Thus, the double pixel search is performed on a 50% down-sampled version of the input images

with window dimension 2w + 1 being 15, i.e. w = 7. Finally, the single pixel matching is performed in 3 × 3

regions, on the original input pair. The window value 2w + 1 used in this final stage is 23, i.e. w = 11. The

block diagram of the presented algorithm is shown in Fig. 6. The choice of 3×3 search regions for the last two

steps of the hierarchical pattern can be explained as follows. The first stage is expected to find the best match

each pixel. As the next stage uses another version of the same image with double dimensions, the initially

matched pixel could have been mapped to any of the 3 × 3 pixels neighborhood in the bigger version of the

image.

From the block diagram it is obvious that up-scaling and down-scaling play critical role in the whole hierar-

chical process. These two image transformations are realized by interpolation algorithms. Image interpolation

can be described as the process of using known data to estimate values at unknown locations. The interpolated

value f(x) at coordinate x in a space of dimension q can be expressed as a linear combination of samples fk

evaluated at integer coordinates k = (k1, k2, . . . , kk) ∈ Zq

f(x) =∑

k∈zq

fkϕint(x − k) ∀x = (x1, x2, . . . , xq) ∈ Rq (4)

The sample weights are given by the values of the basis function ϕint(x − k) [36]. The basis interpolating

function is defined on a continuous-valued parameter and is evaluated at arbitrary points. The resulting values

comprise the final interpolated image.

For the scaling process of our hierarchical matching technique we used a fuzzy scaling technique which

presents fine interpolation quality with real-time performance [7]. A sufficient interpolation technique with

low interpolation error is required for this operation, since the introduced interpolation error from the scaling

technique will be accumulated through the hierarchical process.

June 20, 2010 DRAFT

10

Left Input Image

Right Input Image

Final Disparity

MapConvolution

Cellular

Automata

Refinement

Minimum

Selection

2 Line

Buffer

2 Line

Buffer

Interpolation

Left Downscaled Image

Right Downscaled Image

3 Line

Buffer

3 Line

Buffer

Absolute

Differences

Calculation

Matching Cost Aggregation Unit

3 Line

Buffer

Upscaled Quadruple/Double Search Disparity Map

2 Line

BufferQuadruple/Double Search Disparity Map

Reg Reg Reg

Disparity

Selection

Unit

Absolute

Differences

Calculation Unit

Interpolation Unit

Gaussian

Mask

Generation

Reg

Fig. 7. Block diagram of the proposed hardware architecture.

The selected interpolation algorithm utilizes a dynamic mask combined with a sophisticated neighborhood

averaging fuzzy algorithm. More precisely, the functions that contribute to the final interpolated image are the

areas of the input pixels, overlapped by a dynamic mask, and the difference in intensity between the input

pixels. Simple fuzzy if-then rules are involved in the inference process of these two components in order to

carry out the interpolation task.

Summarizing, in the hierarchical matching, extensive search provides better disparity accuracy at the expense

of increased complexity. The performance gain tends to diminish as the search steps increase. Double sample

search demonstrates a significant gain over quadruple; single sample search exhibits a moderate further im-

provement and so on. Furthermore, the above technique avoids being trapped into local minima which is often

the case of other fast disparity algorithms. This is achieved by employing multiple predictors and dual pattern

refinement and further improvement in disparity estimation speed can be achieved by using early termination

criterion.

V. HARDWARE ARCHITECTURE

The system is comprised of four basic functional units: 1) the interpolation unit; 2) the absolute differences

calculation unit; 3) the matching cost aggregation unit; and 4) the disparity selection unit, as shown in Fig. 7.

The input data of the system are the left and right non rectified images, and the output of the system is the

final disparity map.

The proposed architecture was designed to perform the disparity map calculation based on a sequence of

pipeline stages for non rectified images. Parallel processing was used whenever possible in order to further

accelerate the process. The structure is focused on the proposed matching cost aggregation unit and the proposed

June 20, 2010 DRAFT

11

3 Line

Buffer

3 Line

Buffer

Subtraction

Reg

Left Input Image

Right Input Image

Absolute Value

Calculation

Intermediate

Disparity Map

MUX

Regi x j x D

Register

To Matching Cost

Aggregation Unit

Fig. 8. Absolute differences calculation module.

hierarchical matching technique.

A. Interpolation

The hardware implementation for the interpolation unit is a special case of the selected algorithm [7],

since we need only the ×2 upscaling and downscaling. The same circuit is used for both upscaling and

downscaling. The internal memory contains the fuzzy set base, consisting of the membership function values

associated with the fuzzy sets, the conditional rules to be processed [37], look-up tables and constants needed

for various calculations. The values of the Gaussian membership functions are determined through an iterative

approximation method implemented in hardware, thus no extra memory is required for a construction of an

additional look-up table. The inference is performed using the max-min method and the center of gravity

defuzzification method has been utilized to obtain the output.

Since the interpolation algorithm uses maximum of four neighbor pixels of the non rectified images, only

two line buffers are required for both vertical or horizontal mask filtering.

B. Absolute Differences Calculation

The matching cost computation, shown in Fig. 8, is calculated among pixels of the two input images. Each

pixel of the left input image is loaded and forwarded to the subtraction module. Moreover, the position of

the left image’s loaded pixel in conjunction with the previously calculated intermediate disparity map, if any,

control the flow the right image’s pixels towards the subtraction module, through a multiplexer. The result of

each subtraction are fed to an absolute value computing circuit, and stored to a register of proper length.

C. Matching Cost Aggregation

The stored results of the absolute differences module are forwarded as input of the matching cost aggregation

module, shown in Fig. 9. A proper window selection sub-module decides on the size of the utilized aggregation

window, loading at the same time proper length of data from the input. The window size is used to fetch the

appropriate gaussian weighting factors from a Look-Up-Table (LUT). The loaded data and the fetched gaussian

weights are fed to a convolution calculating circuit. This circuit can be implemented using only registers, adders

and multipliers. The convolution results are accumulated in a semi-parallel way to a proper register. Finally, the

stored convolution results are refined by the CA recalibration module. Pixels are assigned new values according

to their 3× 3 neighborhoods and the aforementioned two CA rules using only adding and subtracting circuitry.

The CA implementation is minimized and fully parallel.

June 20, 2010 DRAFT

12

Buffer

Convolution

3 x 3

CA

Recalibration

Module

i x j

Accumulation

Module

To Disparity

Selection Unit

Gaussian

Weigths

LUT

Window Selection

sub-Module

From Absolute Differences

Calculation Unit

Fig. 9. Matching cost aggregation module.

D. Disparity Selection

The final disparity selection module is based on a minimum finding circuit. The results matching cost

aggregation module are fed semi-parallely to minimum finder and the index, i.e. the position, of the minimum

value is stored to a buffer. Thus, the final disparity map, comprising of an index (disparity) for each of the

initial images’ pixels is accumulated.

VI. EXPERIMENTAL RESULTS

The presented algorithm has been applied to various image pairs. Each experimentally tested pair’s images

suffered from one or more defects. None of the used images was rectified and some of the image pairs were

both horizontally and vertically displaced. Moreover, the ligthing conditions for the self-captured images were

intentionally bad. Image pairs exhibiting such characteristics are a great challenge for stereo correspondence

algorithms. Moreover, synthetic images are not enough in order to asses an algorithm’s ability to operate in

real outdoor environments. Consequently, both synthetic and self-captured outdoor images have been used in

this section.

A. Synthetic Image Sets

The most common image set for evaluating stereo correspondence algorithms is the Tsukuba data set. While

typical stereo algorithms can only process horizontally displaced pictures of this set, the proposed method can

deal with both horizontal and vertical displacements at the same time. The Tsukuba data set consists of multiple

images of the scene captured by a camera grid with multiple, equally spaced horizontal and vertical steps. The

selected images from the dataset have 32 pixels maximum horizontal displacement, i.e. disparity, and 16 pixels

maximum vertical displacement. Moreover, we have manually induced to the input images the distortion a

lens would have produced. Thus, we have un-rectified the originally rectified images in order to evaluate the

proposed algorithm within this context. The distorted input images as well as the intermediate and the final

results of the procedure are shown in Fig. 10.

The final result, Fig. 10(e), has the same dimensions as the input images, while the previous ones have their

half and quarter dimensions respectively. A full search algorithm would require D × D calculations for every

pixel. On the other hand, the proposed algorithm performs only (D/4) × (D/4) + 3 × 3 + 3 × 3 calculations.

Considering D = 32, it can be found that the proposed algorithm is 15.7 times less computational demanding.

Additionally, the proposed algorithm has been applied to four commonly used image sets. Once again the

image sets were manually distorted with the use of special commercial software. Thus, the radial distortion of

June 20, 2010 DRAFT

13

(a) (b)

(c) (d) (e)

Fig. 10. (a), (b) The non-rectified, diagonally captured input images and the resulting disparity maps for (c) the quadruple, (d) double

and (e) single pixel estimation respectively.

an optical lens was simulated. The distortion induced was 10% for all the four image pair, as well as for their

given ground truth disparity maps. The tested distorted image pairs, as well as the calculated disparity maps

are shown in Fig. 11.

The results shown in Fig. 11 were compared with the respective disparity maps, which had been distorted

to the same degree as the input images i.e. 10%. For each distorted image set, the Normalized Mean Square

Error (NMSE) has been calculated as a quantitate measure of the algorithm’s behavior. Moreover, the proposed

algorithm has been applied to the original undistorted versions of the image sets. The NMSE has been once

more calculated. A typical stereo correspondence algorithm, would have been able to cope with the undistorted

images, but it would have failed to process the distorted ones. The variation of performance, would have been

significant, and always in favor of the undistorted image pairs. In Table I the calculated NMSE for the proposed

algorithm is given, when applied to the distorted and the original versions of the four image sets. The last column

presents the percentage of variation, where positive values indicate better results on the original image sets while

negative values indicate better results on the distorted image sets. It is evident that the proposed algorithm is

not being affected by the presence of non-rectification effects in the processed images.

The manually induced lens distortion percentage, i.e 10%, was chosen as a typical mean value. However, the

performance of the proposed algorithm was tested for various values of induced lens distortion. Seven versions

of the Tsukuba image sets were prepared and tested. In Fig. 12 the two distorted input images, as well as

the calculated disparity maps, are shown for various percentages of distortion. The calculated NMSE for each

version is given in Table II and these results can be visually assest in Fig. 13. It can be deduced that the

proposed algorithm presents a stable behavior over a large range of distortion values.

June 20, 2010 DRAFT

14

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

(j) (k) (l)

Fig. 11. From left to right: the left and right 10% distorted input images and the calculated final disparity map for the (from up to down:)

Tsukuba, Venus, Teddy and Cones image sets respectively.

B. Self-captured Image Sets

Furthermore, the algorithm has been applied on self-captured image pairs, Fig. 14-Fig. 16. The used pairs

suffer from typical outdoor enviroment’s issues. Apart from being shot from cameras dispaced both horizontally

and vertically, having not parallel directions they involve textureless areas and difficult lighting conditions.

Moreover, examination of Fig. 14(a), 14(b) and Fig. 16(a), 16(b) reveals that the different position of the

cameras result in lighting and chromatic differences.

June 20, 2010 DRAFT

15

TABLE I

CALCULATED NMSE FOR THE PROPOSED ALGORITHM FOR VARIOUS PAIRS WITH CONSTANT DISTORTION 10%

Pair NMSE Variation (%)

Distorted Original

Tsukuba 0.0712 0.0781 -0.097

Venus 0.0491 0.0461 +0.061

Teddy 0.1098 0.0976 +0.111

Cones 0.0500 0.0519 -0.038

(a) Distortion 0% (b) Distortion 2.5%

(c) Distortion 5% (d) Distortion 7.5%

(e) Distortion 10% (f) Distortion 12.5%

(g) Distortion 15%

Fig. 12. (from left to right:) The left and right distorted input images and the calculated final disparity maps for various percentages of

the induced lens distortion.

��

��

��

��

��

��

� ��

NM

SE

Distortion Percentage

Fig. 13. The NMSE for the Tsukuba image pair for various distortion percentages.

June 20, 2010 DRAFT

16

(a) (b)

(c) (d) (e)

Fig. 14. (a), (b) The self-captured input images of an alley, and the resulting disparity maps for (c) the quadruple, (d) double and (e)

single pixel estimation respectively.

(a) (b)

(c) (d) (e)

Fig. 15. (a), (b) The self-captured input images of a building, and the resulting disparity maps for (c) the quadruple, (d) double and (e)


June 20, 2010 DRAFT

17

TABLE II

CALCULATED NMSE FOR THE PROPOSED ALGORITHM FOR THE TSUKUBA PAIR WITH VARIOUS DISTORTION PERCENTAGES

Distortortion (%) NMSE

0.0 0.0781

2.5 0.0712

5.0 0.0708

7.5 0.0663

10 0.0712

12.5 0.0723

15 0.0761

(a) (b)

(c) (d) (e)

Fig. 16. (a), (b) The self-captured input images of a corner, and the resulting disparity maps for (c) the quadruple, (d) double and (e)


VII. CONCLUSIONS AND DISCUSSION

In this work a disparity estimation technique is proposed. It is able to process non-rectified input images

from uncalibrated stereo cameras and at the same time retain low computational complexity. The hierarchical

search scheme is based on the JVT/H.264 motion estimation algorithm, initially developed for video coding.

The proposed algorithm searches for stereo correspondences inside D × D search blocks requiring, however,

significantly less computations than a typical full search.

Sophisticated methods and techniques, such as gaussian weighted aggregation and 3-D CA refinement rules

have been applied to a fuzzy-based hierarchical process. All the chosen modules of the presented algorithm can

be parallelized and pipelined, resulting into an efficient hardware implementation. Consequently, the presented

June 20, 2010 DRAFT

18

hardware architecture of the algorithm can make the proposed method suitable for autonomous, real-time outdoor

robotics applications. The use of the aforementioned refining techniques add complexity to the overall algorithm

providing however higher quality of results. In this sense, the trades off only some of its structural simplicity

for better results.

The proposed algorithm’s performance is retained practically unaffected by spatial displacements, lens distor-

tions or lighting asymmetries in the input images, as was qualitatively and quantitatively indicated. Moreover,

the ability to tolerate poorly or even not calibrated input images in conjunction with its speed and the presented

result quality, show that this algorithm can cope with the demanding issue of outdoor navigation.

ACKNOWLEDGMENT

This work was supported by the E.C. under the FP6 research project for vision and chemiresistor equipped

web-connected finding robots, ”View-Finder”, IST-2005-045541.

REFERENCES

[1] Konolige K, Agrawal M, Bolles RC, Cowan C, Fischler M, Gerkey BP. Outdoor Mapping and Navigation Using Stereo Vision. In:

Khatib O, Kumar V, Rus D, editors. ISER. vol. 39 of Springer Tracts in Advanced Robotics. Springer; 2006. p. 179–190.

[2] Soquet N, Aubert D, Hautiere N. Road Segmentation Supervised by an Extended V-Disparity Algorithm for Autonomous Navigation.

In: IEEE Intelligent Vehicles Symposium. Istanbul, Turkey; 2007. p. 160–165.

[3] Hogue A, German A, Jenkin M. Underwater environment reconstruction using stereo and inertial data. In: IEEE International

Conference on Systems, Man and Cybernetics. Montreal, Canada; 2007. p. 2372–2377.

[4] Klancar G, Kristan M, Karba R. Wide-angle camera distortions and non-uniform illumination in mobile robot tracking. Journal of

Robotics and Autonomous Systems. 2004;46:125–133.

[5] http://vision.middlebury.edu/stereo/; 2009.

[6] Nalpantidis L, Sirakoulis GC, Gasteratos A. In: Darzentas J, Vouros GA, Vosinakis S, Arnellos A, editors. A dense stereo

correspondence algorithm for hardware implementation with enhanced disparity selection. vol. 5138 of Lecture Notes in Computer

Science. Berlin-Heidelberg: Springer-Verlag; 2008. p. 365–370.

[7] Amanatiadis A, Andreadis I, Konstantinidis K. Design and Implementation of a Fuzzy Area-Based Image-Scaling Technique. IEEE

Transactions on Instrumentation and Measurement. 2008;57(8):1504–1513.

[8] Nalpantidis L, Sirakoulis GC, Gasteratos A. Review of Stereo Vision Algorithms: from Software to Hardware. International Journal

of Optomechatronics. 2008;2(4):435–462.

[9] Yin P, Tourapis HYC, Tourapis A, Boyce J. Fast mode decision and motion estimation for JVT/H.264. In: Proc. Int. Conf. Image

Process.. vol. 3; 2003. p. 853–856.

[10] Wiegand T, Sullivan G, Bjntegaard G, Luthra A. Overview of the H.264/AVC video coding standard. IEEE Trans Circuits Syst Video

Technol. 2003;13(7):560–576.

[11] Marr D, Poggio T. Cooperative computation of stereo disparity. Science. 1976;194(4262):283.

[12] Murray D, Little JJ. Using real-time stereo vision for mobile robot navigation. Autonomous Robots. 2000;8(2):161–171.

[13] Murray D, Jennings C. Stereo vision based mapping and navigation for mobile robots. In: Proc. IEEE Int. Conf. on Robotics and

Automation. vol. 2; 1997. p. 1694–1699.

[14] Jain R, Kasturi R, Schunck BG. Machine vision. McGraw-Hill; 1995.

[15] Gasteratos A, Sandini G. In: Vlahavas IP, Spyropoulos CD, editors. Factors Affecting the Accuracy of an Active Vision Head. vol.

2308 of Lecture Notes in Computer Science. Berlin-Heidelberg: Springer-Verlag; 2002. p. 413–422.

[16] De Cubber G, Doroftei D, Nalpantidis L, Sirakoulis GC, Gasteratos A. Stereo-based Terrain Traversability Analysis for Robot

Navigation. In: IARP/EURON Workshop on Robotics for Risky Interventions and Environmental Surveillance. Brussels, Belgium;

2008. .

[17] Schreer O. Stereo Vision-Based Navigation in Unknown Indoor Environment. In: 5th European Conference on Computer Vision.

vol. 1; 1998. p. 203–217.

June 20, 2010 DRAFT

19

[18] Scharstein D, Szeliski R. A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms. International Journal

of Computer Vision. 2002;47(1-3):7–42.

[19] Torra PHS, Criminisi A. Dense stereo using pivoted dynamic programming. Image and Vision Computing. 2004;22(10):795–806.

[20] Labayrade R, Aubert D, Tarel JP. Real time obstacle detection in stereovision on non flat road geometry through ”v-disparity”

representation. In: IEEE Intelligent Vehicle Symposium. vol. 2. Versailles, France; 2002. p. 646–651.

[21] Kelly A, Stentz AT. Stereo Vision Enhancements for Low-Cost Outdoor Autonomous Vehicles. In: International Conference on

Robotics and Automation, Workshop WS-7, Navigation of Outdoor Autonomous Vehicles, (ICRA ’98); 1998. .

[22] Zhao J, Katupitiya J, Ward J. Global correlation based ground plane estimation using v-disparity image. In: IEEE International

Conference on Robotics and Automation. Rome, Italy; 2007. p. 529–534.

[23] Agrawal M, Konolige KG, Bolles RC. Localization and Mapping for Autonomous Navigation in Outdoor Terrains: A Stereo Vision

Approach. In: IEEE Workshop on Applications of Computer Vision; 2007. p. 7–7.

[24] Zach C, Karner K, Bischof H. Hierarchical disparity estimation with programmable 3D hardware. In: Proc. Int. Conf. in Central

Europe on Computer Graphics, Visualization and Computer Vision; 2004. p. 275–282.

[25] Jeong H, Park S, Pohang K, Korea S. Generalized Trellis stereo matching with systolic array. Lecture Notes in Computer Science.

2004;3358:263–267.

[26] Park S, Jeong H, Pohang K, Korea S. Real-time stereo vision FPGA chip with low error rate. In: Proc. Int. Conf. on Multimedia

and Ubiquitous Engineering; 2007. p. 751–756.

[27] Masrani D, MacLean W. A Real-time large disparity range stereo system using FPGAs. In: Proc. IEEE Int. Conf. on Computer

Vision Systems; 2006. p. 13–13.

[28] Von Neumann J. Theory of Self-Reproducing Automata. Urbana, Illinois: University of Illinois Press; 1966.

[29] Chopard B, Droz M. Cellular Automata Modeling of Physical systems. Cambridge: Cambridge University Press; 1998.

[30] Ulam S. Random processes and transformations. In: International Congress on Mathematics. vol. 2. Cambridge, USA; 1952. p.

264–275.

[31] Feynman R. Simulating Physics with Computers. International Journal of Theoretical Physics. 1982;21(6):467–488.

[32] Wolfram S. Theory and Applications of Cellular Automata. Singapore: World Scientific; 1986.

[33] Mardiris V, Sirakoulis GC, Mizas C, Karafyllidis I, Thanailakis A. A CAD System for Modeling and Simulation of Computer

Networks Using Cellular Automata. IEEE Transactions on Systems, Man, and Cybernetics, Part C. 2008;38(2):253–264.

[34] Kotoulas L, Gasteratos A, Sirakoulis GC, Georgoulas C, Andreadis I. Enhancement of Fast Acquired Disparity Maps using a 1-D

Cellular Automation Filter. In: IASTED International Conference on Visualization, Imaging and Image Processing. Benidorm, Spain;

2005. p. 355–359.

[35] Sirakoulis GC, Karafyllidis I, Thanailakis A. A CAD system for the construction and VLSI implementation of Cellular Automata

algorithms using VHDL. Microprocessors and Microsystems. 2003;27(8):381–396.

[36] Thevenaz P, Blu T, Unser M. Interpolation revisited. IEEE Trans Med Imag. 2000;19(7):739–758.

[37] Ascia G, Catania V, Russo M. VLSI hardware architecture for complex fuzzy systems. IEEE Trans Fuzzy Syst. 1999;7(5):553–570.

June 20, 2010 DRAFT

Documents

Efficient hierarchical matching algorithm for processing uncalibrated stereo vision images and its hardware architecture