Upload
dtu
View
0
Download
0
Embed Size (px)
Citation preview
1
An Efficient Hierarchical Matching
Algorithm for Processing Uncalibrated Stereo
Vision Images and its Hardware ArchitectureLazaros Nalpantidis1, Angelos Amanatiadis2, Georgios Sirakoulis2 and Antonios Gasteratos1
1Laboratory of Robotics and Automation, Department of Production and Management Engineering,
Democritus University of Thrace, University Campus Kimmeria, GR-67100, Xanthi, Greece
Email: {lanalpa, agaster} @pme.duth.gr2Laboratory of Electronics, Department of Electrical and Computer Engineering,
Democritus University of Thrace, 12 Vas. Sofias Str., GR-67100, Xanthi, Greece
Email: {aamanat, gsirak} @ee.duth.gr
Abstract
In motion estimation, the sub-pixel matching technique involves the search of sub-sample positions as well as
integer-sample positions between the image pairs, choosing the one that gives the best match. Based on this idea,
this work proposes an estimation algorithm, which performs a 2-D correspondence search using a hierarchical
search pattern. The intermediate results are refined by 3-D Cellular Automata (CA). The disparity value is then
defined using the distance of the matching position. Therefore, the proposed algorithm can process uncalibrated,
non-rectified stereo image pairs, maintaining the computational load within reasonable levels. Additionally, a
hardware architecture of the algorithm is deployed. Its performance has been evaluated on both synthetic and
real self-captured image sets. Its attributes, make the proposed method suitable for autonomous outdoor robotic
applications.
Index Terms
Disparity estimation, hierarchical matching, uncalibrated stereo vision, outdoor robotics.
I. INTRODUCTION
Outdoor robotics place strict constraints on the used algorithms [1], [2]. The lighting conditions in outdoor
environments are far from being ideal. A stereo camera, which acquires two displaced views of the same
scene, is very sensitive to such conditions [3], [4]. Moreover, the rough terrain and the bounces it causes to a
moving robot often decalibrate the cameras of a stereo acquisition array. Autonomous operation demands high
processing frame-rates. On the other hand, a robotic platform can provide limited computational resources,
power and payload capacity for many different onboard applications. These facts differentiate the priorities of
stereo vision algorithms intended for use in outdoor operating robots from those listed in [5]. The algorithms
listed in the aforementioned site compete based on their accuracy on four perfectly lighted, calibrated and
June 20, 2010 DRAFT
2
rectified image sets without any timing or computational constraints. However this not the issue in outdoor
environments, and consequently a comparison of the proposed algorithm with those would be pointless.
This work presents a novel stereo vision algorithm, inspired by recent motion estimation techniques. The
proposed algorithm has been adapted to the demands of the contemporary outdoor robotic applications. It is
based on a fast executed Sum of Absolute Differences (SAD) core for correspondence search in both directions
of the input images. The results of this core are enhanced using sophisticated computational techniques; gaussian
weighted aggregation and 3-D CA rules are used [6]. The hierarchical iteration of the basic stereo algorithm
was achieved using a fuzzy scaling technique [7]. The aforementioned characteristics provide improved quality
of results, being at the same time easy to be hardware implemented. As a result, the presented algorithm is
able to cope with uncalibrated input images. On the other hand, the hardware implementation of the already
proposed algorithms found in literature, is not always straightforward [8]. Nevertheless the need of hardware
implementation of efficient and robust stereo algorithms able to provide real-time frame rates especially in the
case of moving robots is inevitable. The allure of hardware implementations is that they easily outperform the
algorithms executed on a computer. The achieved frame-rates are generally higher. The power consumed by a
dedicated hardware platform, e.g. ASIC or FPGA, is considerably lower than that of a common microprocessor.
Moreover, the computational power of the robot’s onboard available PCs is left intact. As a result, in this work
we also propose an efficient hardware implementation, based on the proposed parallel techniques, which can
serve the aforementioned needs and provide both qualitative results and real-time frame rates at the same time.
Additionally, the perception of object’s depth in a scenery is crucial for autonomous robotic applications. The
accuracy of the depth estimation results and their rapid refresh rate, as a robot moves, is the cornerstone of plenty
more high level behaviors. Since our proposed scheme is block matching based and does not perform scanline
pixel matching, it does require neither camera calibration nor image rectification. However, it is clear that block
matching approaches require more computational resources since the number of pixels to be considered is
greatly increased. In order to address this problem, the proposed algorithm is a variation of a motion estimation
algorithm [9] which is used for JVT/H.264 video coding [10]. The adaptation of compression motion estimation
algorithms into disparity estimation schemes can be effective both in accuracy and complexity terms, since
compression algorithms also attempt to achieve complexity reduction while maintaining coding efficiency. On
the other hand, CA have been employed as a intelligent and efficient way to refine and enhance the stereo
algorithm’s intermediate results.
In the length of this paper, a brief introduction in stereo vision, for readability reasons, is presented in section
II. The related work is found in section III, while the proposed estimation disparity algorithm is described in
full detail in section IV. The corresponding hardware architecture of the proposed algorithm is discussed in
section V, and the resulting experimental results are shown in section VI. Finally, the conclusions are drawn in
section VII.
II. STEREO VISION
The estimation of the disparity between two images of the same scene is a long-standing issue for the
machine vision community [11]. Stereoscopic vision is based on the principal, first utilized by nature, that two
spatially differentiated views of the same scene provide enough information so as to perceive the depth of the
June 20, 2010 DRAFT
3
(a) (b)
Fig. 1. Geometry of epipolar lines, where C1 and C2 are the left and right camera lens centers, respectively. (a) Point P1 in one image
plane may have arisen from any of the points in the line C1P1, and may appear in the alternate image plane at any point on the epipolar
line E2; (b) In the case of non-rectified images, point P1 may have arisen for any of the points inside the block B.
portrayed objects. Thus, the importance of stereo correspondence is apparent in the fields of machine vision,
virtual reality, robot navigation, simultaneous localization and mapping [12], [13], depth measurements and 3-D
environment reconstruction [14].
The two alternatives for estimating disparity are either to precisely align the stereo camera rig and then
perform the demanded rectification (leading to simple scanline searches), or to have arbitrary stereo cameras
setup and avoid any rectification (performing searching throughout blocks). Accurately aligned stereo devices
are very expensive, as they demand calibration of a series of factors in micrometer scale [15]. On the other hand,
non-ideal stereo configurations usually produce inferior results, as they fail to satisfy the epipolar constraint.
To reduce disparity estimation complexity, the used cameras are usually arranged in a parallel-axis configura-
tion or equivalently, the stereo image pairs are carefully rectified using the camera geometry. The main function
of a stereo correspondence algorithm is to match each (in the case of dense stereo) or some (in the case of
sparse stereo) pixels of the first image to their corresponding ones in the second image. The outcome of this
process is a depth image, i.e. a disparity map. This matching can be done as a 1-D search if the stereo pairs
are accurately rectified. In rectified pairs horizontal scan lines reside on the same epipolar lines, as shown in
Fig. 1(a)
A point P1 in one image plane may have arisen from any of points in the line C1P1, and may appear in the
alternate image plane at any point on the so-called epipolar line E2. Thus, the search is theoretically reduced
within a scan line, since corresponding pair points reside on the same epipolar line. The difference on the 2
horizontal coordinates of these points is the disparity. The disparity map results from assigning to each matched
pixel the disparity value of its correspondent. Afterwards, the depth of the scenery can be derived from the
disparity map, since the more distant an object is, the larger disparity value it is expected to have.
However, this approach make disparity algorithms application restrictive and even a small parallel axis
deviation introduces great disparity errors. On the other hand, the more general case of non-rectified image
pairs, shown in Fig. 1(b) leads a boost of demanded calculations. In this case, all the points inside a 2-D area
have to be considered as possible matches, in contrast to the previously mentioned 1-D search.
June 20, 2010 DRAFT
4
III. RELATED WORK
Autonomous robots’ behavior greatly depends on the accuracy of their decision making algorithms [16]. In
the case of stereo vision-based navigation, the accuracy and the refresh rate of the computed disparity maps are
the cornerstone of its success [17]. Dense local stereo correspondence methods calculate depth for almost every
pixel of the scenery, talking into consideration only a small neighborhood of pixels each time [18]. On the other
hand, global methods are significantly more accurate but at the same time more computationally demanding,
as they account for the whole image [19]. However, since the most urgent constraint in autonomous robotics
is the real-time operation, such applications usually utilize local algorithms.
In the relevant literature, many stereo vision based methods have been proposed for outdoor navigation. Much
attention has been given to the issue by the automotive industry [20], [2] and enhancement methods have been
proposed [21]. Many advanced behaviors base their success upon the accuracy of the stereo vision based depth
calculation [22], [1], [23]. These works employ a fast SAD-based algorithm to provide the necessary depth
maps. The availability of reliable and noise-free disparity maps is evident. On the other hand, common issues
related to outdoor exploration, such as possible decalibration of the stereo system and tolerance to non-perfect
lighting conditions, are not addressed. Furthermore, to the best of our knowledge the majority of published
algorithms used for outdoor robot navigation have not been implemented on hardware. Unfortunately, there is
no commonly agreed test-bench for outdoor-focused real-time stereo algorithms, analogous to [5].
The issue of processing non-rectified images is common to applications where the sensory system is not
explicitly specified. The plethora of computations most commonly require the massive parallelization found in
custom tailored hardware implementations. The contemporary powerful graphics machines are able to achieve
enhanced results in terms of processing time and data volume. A hierarchical disparity estimation algorithm
implemented on programmable 3-D graphics processing unit is reported in [24]. This method can process either
rectified or non-rectified image pairs. Bidirectional matching is utilized in conjunction with a locally aggregated
sum of absolute intensity differences. This implementation, on an ATI Radeon 9700 Pro, can achieve up to 50 fps
for 256× 256 pixel input images. The FPGA implementation presented in [25] uses the dynamic programming
search method on a Trellis solution space. It copes with the vergent cameras case, i.e. cameras with optical
axes that intersect arbitrarily, producing non-rectified stereo pairs. The image pairs received from the cameras
are initially rectified using linear interpolation and then, during a second step, the disparity is calculated. The
architecture has the form of a linear systolic array using simple processing elements. The design is canonical
and simple to be implemented in parallel. The implementation requires 208 processing elements. The resulting
system can process 1280 × 1000 pixel images with up to 208 disparity levels at 15 fps. An extension of the
previous method is presented in [26]. The main difference is that data from previously processed lines are
incorporated so as to enforce better inter-scanline inconsistency. The running speed is 30 fps for 320 × 240
pixel images with 128 disparity levels. The number of utilized processing elements is 128. The percentage of
pixels with disparity error larger than 1 in the non-occluded areas is 2.63, 0.91, 3.44, and 1.88 for the Tsukuba,
Map, Venus and Sawtooth image sets, respectively. Finally, [27] proposes the utilization of the local weighted
phase-correlation method. The platform used is the Transmogrifer-4 containing four Altera Stratix S80 FPGAs.
The system performs rectification and left-right consistency check to improve the accuracy of the results. The
June 20, 2010 DRAFT
5
Absolute
Differences
Calculation
Matching Cost Aggregation
Gaussian
Weighted
Aggregation
CA
Refinement
Disparity
Selection Disparity map
Right Image
Left Image
Fig. 2. Block diagram of the proposed stereo correspondence algorithm.
speed for 640 × 480 pixel images with 128 disparity levels is 30 fps.
A detailed taxonomy and presentation of dense stereo correspondence algorithms can be found in [18].
Additionally, the recent advances in the field as well as the aspect of hardware implementable stereo algorithms
are covered in [8].
IV. PROPOSED DISPARITY ESTIMATION ALGORITHM
A. Stereo Correspondence Algorithm
The proposed system utilizes a simple, rapidly executed stereo correspondence algorithm applied to each
stereo pair. Stereo disparity is computed using a typical three-stage local stereo correspondence algorithm
[6]. The main merit of the used algorithm is the low computational complexity. It is essentially a simple
SAD algorithms, with an aggregation step enhanced by CA rules. The algorithm is focused on achieving
high processing frame-rates. The structural elements of this algorithm are presented in Fig. 2. The stereo
correspondence algorithm wittingly involves a non-iterative disparity selection step. Moreover, the frequently
utilized additional final refinement, i.e. filtering or interpolation, step is absent for speed reasons.
The matching cost function utilized is the Absolute Differences (AD) which is inherently the simplest metric
of all, involving only summations and finding absolute values.
cost(x, y, d) = |Ileft(x, y) − Iright((x − d), y)| (1)
where Ileft and Iright denote the intensity values for the left and right image respectively, d is the value of the
disparity under examination ranging for 0 to D − 1 and x, y are the coordinates of the pixel on the i, j plane.
Disparity Space Image (DSI) is a 3-D matrix, i.e. the i× j ×D volume, containing the computed matching
costs for every pixel and for all its potential disparity values, as shown in Fig. 3(a). The DSI values for
constant disparity value are aggregated inside fix-sized square windows. The dimensions (2w + 1)× (2w + 1)
of the chosen aggregation window play an important role in the quality of the final result. Generally, small
dimensions preserve details but suffer from noise. On the contrast, large dimensions may not preserve fine
details but significantly suppress the noise. However, the choice of the aggregation window’s dimensions has
to take into account the size of the processed input images. That is, the dimensions have to scale according to
the image size in order to obtain the same amount of detail and noise suppression.
The AD aggregation step of the proposed algorithm is a weighted summation. Each pixel is assigned a weight
depending on its Euclidean distance from the central pixel. A 2-D Gaussian function determines the weights
value for each pixel. The center of the function coincides with the central pixel. The standard deviation is
equal to the distance from the central pixel to the nearest window-border. The applied weighting mask can be
June 20, 2010 DRAFT
6
(a) DSI containing matching costs for
every combination of x, y, d.
(b) A 3×3 neighborhood
inside the DSI.
Fig. 3. Views of the Disparity Space Image (DSI).
calculated once and then be applied to all the aggregation windows without any further change. The weighted
summation can be also considered as a convolution between the cost function values and a weighting mask
matrix. Thus, the computational load of this procedure is kept within reasonable limits.
aggr(x, y, d) =w∑
i=−w
w∑
j=−w
mask((x + i), (y + j), d) · cost(x, y, d) (2)
where, the pixel ranges [−w, w] define the weighted aggregation window.
Afterwards, CA rules are used in order to refine further the procedure results. CA were first introduced by
von Neumann [28], who was thinking of imitating the behavior of a human brain in order to build a machine
able to solve very complex problems [28]. His ambitious project was to show that complex phenomena can,
in principle, be reduced to the dynamics of many identical, very simple primitives, capable of interacting and
maintaining their identity [29]. Following a suggestion by [30], von Neumann adopted a fully discrete approach,
in which space, time and even the dynamical variables were defined to be discrete. Consequently, CA comprise
a very effective computational tool in simulating physical systems and solving scientific problems, because they
can capture the essential features of systems where global behavior arises from the collective effect of simple
components which interact locally [31], [32]. In CA analysis, physical processes and systems are described by
a cell array and a local rule, which defines the new state of a cell depending on the states of its neighbors. All
cells can work in parallel due to the fact that each cell can independently update each own state. Therefore the
proposed CA model/algorithm is massively parallel and is an ideal candidate to be implemented in hardware
[33], [34], [35].
Two CA transition rules are applied to 3 × 3 pixel neigborhood inside the DSI, as shown in Fig. 3(b).
The values of parameters used by them were determined after extensive testing to perform best. The first rule
attempts to resolve disparity ambiguities. It checks for excessive consistency of results along the disparity (d)
axis and, if necessary, corrects on the perpendicular (i, j) plane. The second rule is placed in order to smoothen
the results and at the same time to preserve the details. It checks and acts on constant-disparity planes.
1) if at least one of the two pixels lying from either sides of a pixel across the disparity axis (d) differs
from the central pixel less than half of its value, then its value is further aggregated within its 3×3 pixel,
constant-disparity neighborhood.
First CA rule Pseudocode
June 20, 2010 DRAFT
7
if {
|aggr(x,y,d)-aggr(x,y,d-1)|
< (1/2)aggr(x,y,d) }
or {
|aggr(x,y,d)-aggr(x,y,d+1)|
< (1/2)aggr(x,y,d) }
then {
for i,j = (-1,0,1) {
aggr(x,y,d) =
(1/9)sum(sum(aggr(x+i,y+j,d) }}
2) if there are at least 7 pixels in the 3× 3 pixel neighborhood which differ from the central pixel less than
half of the central pixels value, then the central pixels value is scaled down by the factor 1.3 as dictated
by exhaustive testing.
Second CA rule Pseudocode
for i,j = (-1,0,1) {
while (i and j)<>0 {
if {
|aggr(x+i,y+j,d)-aggr(x,y,d)|
< (1/2)aggr(x,y,d) }
then {
count++ }}}
if {
count>=7 }
then {
aggr(x,y,d)=(1/1.3)aggr(x,y,d) }
The two rules are applied once. Their outcome comprises the enhanced DSI that will be used in order the
optimum disparity map to be chosen by a simple, non-iterative Winner-Takes-All (WTA) final step.
In the last stage the best disparity value for each pixel is decided by a WTA selection procedure. For each
image pixel coordinates (i, j) the smaller value is searched for on the d axis and its position is declared to be
the pixels disparity value. That is:
disp(x, y) = arg(min(aggr(x, y, d))) (3)
B. Hierarchical Matching Technique
The most computational intensive process for a dense disparity algorithm is the accurate pixel matching for
each pair of images. In this section we will describe the adapted motion estimation scheme that we used in our
implementation for disparity estimation, which was first proposed in [9] for JVT/H.264 video coding [10].
June 20, 2010 DRAFT
8
Fig. 4. Quadruple, double and single pixel sample matching algorithm.
Fig. 5. General scheme of the proposed hierarchical matching disparity algorithm. The search block is enlarged for viewing purposes.
Let the maximum expected horizontal disparity for a stereo image pair be D. The dimensions of the stereo
pixel matching search block are D×D. For each search block, the disparity value is determined by the horizontal
distance of the (single pixel sampling) best match in terms of minimum sum of absolute differences, as shown
in Fig. 4. In the first stage, the disparity algorithm finds the best match on the quadruple sample grid (circles).
Then, the algorithm searches the double pixel positions next to this best match (squares) to assess whether the
match can be improved and if so, the single pixel positions next to the best double pixel position (triangles) are
then explored. The general scheme of the proposed hierarchical matching disparity algorithm between a stereo
image pair is shown in Fig. 5. Each of the intermediate disparity maps of the first two steps are used as initial
conditions for the succeeding, refining correspondence searches.
In order to perform the hierarchical disparity search three different versions of the input images are employed
and the stereo correspondence algorithm is applied to each of these three pairs. The quadruple search step is
performed as a normal pixel-by-pixel search, on a quarter-size version of the input images. That is, each of the
initial images has been down-sampled to 25% of their initial dimensions. The quadruple search is performed
by applying the stereo correspondence algorithm in (D/4) × (D/4) search regions, on the down-sized image
pair (D being the maximum expected horizontal disparity value in the original image pair). The choice of the
maximum searched disparity D/4 is reasonable as the search is performed on a 1/4 version of the original
images.
The window value 2w + 1 used in this stage is 9, i.e. w = 4. Once the best match is obtained for each
pixel, another correspondence search is performed in 3 × 3 search regions, on a half-size version of the initial
June 20, 2010 DRAFT
9
Upscale by
factor 2
Left Image
Stereo
Correspondence
Algorithm
Downsize to
25%
Downize to
25%
Stereo
Correspondence
Algorithm
Downscale
to 50%
Downscale
to 50%
Quadruple Search
Disparity Map
Right Image
Stereo
Correspondence
Algorithm
Final
Disparity map
Double Search
Disparity Map
Upscale by
factor 2
Fig. 6. Block diagram of the hierarchical disparity search algorithm.
image pair. Thus, the double pixel search is performed on a 50% down-sampled version of the input images
with window dimension 2w + 1 being 15, i.e. w = 7. Finally, the single pixel matching is performed in 3 × 3
regions, on the original input pair. The window value 2w + 1 used in this final stage is 23, i.e. w = 11. The
block diagram of the presented algorithm is shown in Fig. 6. The choice of 3×3 search regions for the last two
steps of the hierarchical pattern can be explained as follows. The first stage is expected to find the best match
each pixel. As the next stage uses another version of the same image with double dimensions, the initially
matched pixel could have been mapped to any of the 3 × 3 pixels neighborhood in the bigger version of the
image.
From the block diagram it is obvious that up-scaling and down-scaling play critical role in the whole hierar-
chical process. These two image transformations are realized by interpolation algorithms. Image interpolation
can be described as the process of using known data to estimate values at unknown locations. The interpolated
value f(x) at coordinate x in a space of dimension q can be expressed as a linear combination of samples fk
evaluated at integer coordinates k = (k1, k2, . . . , kk) ∈ Zq
f(x) =∑
k∈zq
fkϕint(x − k) ∀x = (x1, x2, . . . , xq) ∈ Rq (4)
The sample weights are given by the values of the basis function ϕint(x − k) [36]. The basis interpolating
function is defined on a continuous-valued parameter and is evaluated at arbitrary points. The resulting values
comprise the final interpolated image.
For the scaling process of our hierarchical matching technique we used a fuzzy scaling technique which
presents fine interpolation quality with real-time performance [7]. A sufficient interpolation technique with
low interpolation error is required for this operation, since the introduced interpolation error from the scaling
technique will be accumulated through the hierarchical process.
June 20, 2010 DRAFT
10
Left Input Image
Right Input Image
Final Disparity
MapConvolution
Cellular
Automata
Refinement
Minimum
Selection
2 Line
Buffer
2 Line
Buffer
Interpolation
Left Downscaled Image
Right Downscaled Image
3 Line
Buffer
3 Line
Buffer
Absolute
Differences
Calculation
Matching Cost Aggregation Unit
3 Line
Buffer
Upscaled Quadruple/Double Search Disparity Map
2 Line
BufferQuadruple/Double Search Disparity Map
Reg Reg Reg
Disparity
Selection
Unit
Absolute
Differences
Calculation Unit
Interpolation Unit
Gaussian
Mask
Generation
Reg
Fig. 7. Block diagram of the proposed hardware architecture.
The selected interpolation algorithm utilizes a dynamic mask combined with a sophisticated neighborhood
averaging fuzzy algorithm. More precisely, the functions that contribute to the final interpolated image are the
areas of the input pixels, overlapped by a dynamic mask, and the difference in intensity between the input
pixels. Simple fuzzy if-then rules are involved in the inference process of these two components in order to
carry out the interpolation task.
Summarizing, in the hierarchical matching, extensive search provides better disparity accuracy at the expense
of increased complexity. The performance gain tends to diminish as the search steps increase. Double sample
search demonstrates a significant gain over quadruple; single sample search exhibits a moderate further im-
provement and so on. Furthermore, the above technique avoids being trapped into local minima which is often
the case of other fast disparity algorithms. This is achieved by employing multiple predictors and dual pattern
refinement and further improvement in disparity estimation speed can be achieved by using early termination
criterion.
V. HARDWARE ARCHITECTURE
The system is comprised of four basic functional units: 1) the interpolation unit; 2) the absolute differences
calculation unit; 3) the matching cost aggregation unit; and 4) the disparity selection unit, as shown in Fig. 7.
The input data of the system are the left and right non rectified images, and the output of the system is the
final disparity map.
The proposed architecture was designed to perform the disparity map calculation based on a sequence of
pipeline stages for non rectified images. Parallel processing was used whenever possible in order to further
accelerate the process. The structure is focused on the proposed matching cost aggregation unit and the proposed
June 20, 2010 DRAFT
11
3 Line
Buffer
3 Line
Buffer
Subtraction
Reg
Left Input Image
Right Input Image
Absolute Value
Calculation
Intermediate
Disparity Map
MUX
Regi x j x D
Register
To Matching Cost
Aggregation Unit
Fig. 8. Absolute differences calculation module.
hierarchical matching technique.
A. Interpolation
The hardware implementation for the interpolation unit is a special case of the selected algorithm [7],
since we need only the ×2 upscaling and downscaling. The same circuit is used for both upscaling and
downscaling. The internal memory contains the fuzzy set base, consisting of the membership function values
associated with the fuzzy sets, the conditional rules to be processed [37], look-up tables and constants needed
for various calculations. The values of the Gaussian membership functions are determined through an iterative
approximation method implemented in hardware, thus no extra memory is required for a construction of an
additional look-up table. The inference is performed using the max-min method and the center of gravity
defuzzification method has been utilized to obtain the output.
Since the interpolation algorithm uses maximum of four neighbor pixels of the non rectified images, only
two line buffers are required for both vertical or horizontal mask filtering.
B. Absolute Differences Calculation
The matching cost computation, shown in Fig. 8, is calculated among pixels of the two input images. Each
pixel of the left input image is loaded and forwarded to the subtraction module. Moreover, the position of
the left image’s loaded pixel in conjunction with the previously calculated intermediate disparity map, if any,
control the flow the right image’s pixels towards the subtraction module, through a multiplexer. The result of
each subtraction are fed to an absolute value computing circuit, and stored to a register of proper length.
C. Matching Cost Aggregation
The stored results of the absolute differences module are forwarded as input of the matching cost aggregation
module, shown in Fig. 9. A proper window selection sub-module decides on the size of the utilized aggregation
window, loading at the same time proper length of data from the input. The window size is used to fetch the
appropriate gaussian weighting factors from a Look-Up-Table (LUT). The loaded data and the fetched gaussian
weights are fed to a convolution calculating circuit. This circuit can be implemented using only registers, adders
and multipliers. The convolution results are accumulated in a semi-parallel way to a proper register. Finally, the
stored convolution results are refined by the CA recalibration module. Pixels are assigned new values according
to their 3× 3 neighborhoods and the aforementioned two CA rules using only adding and subtracting circuitry.
The CA implementation is minimized and fully parallel.
June 20, 2010 DRAFT
12
Buffer
Convolution
3 x 3
CA
Recalibration
Module
i x j
Accumulation
Module
To Disparity
Selection Unit
Gaussian
Weigths
LUT
Window Selection
sub-Module
From Absolute Differences
Calculation Unit
Fig. 9. Matching cost aggregation module.
D. Disparity Selection
The final disparity selection module is based on a minimum finding circuit. The results matching cost
aggregation module are fed semi-parallely to minimum finder and the index, i.e. the position, of the minimum
value is stored to a buffer. Thus, the final disparity map, comprising of an index (disparity) for each of the
initial images’ pixels is accumulated.
VI. EXPERIMENTAL RESULTS
The presented algorithm has been applied to various image pairs. Each experimentally tested pair’s images
suffered from one or more defects. None of the used images was rectified and some of the image pairs were
both horizontally and vertically displaced. Moreover, the ligthing conditions for the self-captured images were
intentionally bad. Image pairs exhibiting such characteristics are a great challenge for stereo correspondence
algorithms. Moreover, synthetic images are not enough in order to asses an algorithm’s ability to operate in
real outdoor environments. Consequently, both synthetic and self-captured outdoor images have been used in
this section.
A. Synthetic Image Sets
The most common image set for evaluating stereo correspondence algorithms is the Tsukuba data set. While
typical stereo algorithms can only process horizontally displaced pictures of this set, the proposed method can
deal with both horizontal and vertical displacements at the same time. The Tsukuba data set consists of multiple
images of the scene captured by a camera grid with multiple, equally spaced horizontal and vertical steps. The
selected images from the dataset have 32 pixels maximum horizontal displacement, i.e. disparity, and 16 pixels
maximum vertical displacement. Moreover, we have manually induced to the input images the distortion a
lens would have produced. Thus, we have un-rectified the originally rectified images in order to evaluate the
proposed algorithm within this context. The distorted input images as well as the intermediate and the final
results of the procedure are shown in Fig. 10.
The final result, Fig. 10(e), has the same dimensions as the input images, while the previous ones have their
half and quarter dimensions respectively. A full search algorithm would require D × D calculations for every
pixel. On the other hand, the proposed algorithm performs only (D/4) × (D/4) + 3 × 3 + 3 × 3 calculations.
Considering D = 32, it can be found that the proposed algorithm is 15.7 times less computational demanding.
Additionally, the proposed algorithm has been applied to four commonly used image sets. Once again the
image sets were manually distorted with the use of special commercial software. Thus, the radial distortion of
June 20, 2010 DRAFT
13
(a) (b)
(c) (d) (e)
Fig. 10. (a), (b) The non-rectified, diagonally captured input images and the resulting disparity maps for (c) the quadruple, (d) double
and (e) single pixel estimation respectively.
an optical lens was simulated. The distortion induced was 10% for all the four image pair, as well as for their
given ground truth disparity maps. The tested distorted image pairs, as well as the calculated disparity maps
are shown in Fig. 11.
The results shown in Fig. 11 were compared with the respective disparity maps, which had been distorted
to the same degree as the input images i.e. 10%. For each distorted image set, the Normalized Mean Square
Error (NMSE) has been calculated as a quantitate measure of the algorithm’s behavior. Moreover, the proposed
algorithm has been applied to the original undistorted versions of the image sets. The NMSE has been once
more calculated. A typical stereo correspondence algorithm, would have been able to cope with the undistorted
images, but it would have failed to process the distorted ones. The variation of performance, would have been
significant, and always in favor of the undistorted image pairs. In Table I the calculated NMSE for the proposed
algorithm is given, when applied to the distorted and the original versions of the four image sets. The last column
presents the percentage of variation, where positive values indicate better results on the original image sets while
negative values indicate better results on the distorted image sets. It is evident that the proposed algorithm is
not being affected by the presence of non-rectification effects in the processed images.
The manually induced lens distortion percentage, i.e 10%, was chosen as a typical mean value. However, the
performance of the proposed algorithm was tested for various values of induced lens distortion. Seven versions
of the Tsukuba image sets were prepared and tested. In Fig. 12 the two distorted input images, as well as
the calculated disparity maps, are shown for various percentages of distortion. The calculated NMSE for each
version is given in Table II and these results can be visually assest in Fig. 13. It can be deduced that the
proposed algorithm presents a stable behavior over a large range of distortion values.
June 20, 2010 DRAFT
14
(a) (b) (c)
(d) (e) (f)
(g) (h) (i)
(j) (k) (l)
Fig. 11. From left to right: the left and right 10% distorted input images and the calculated final disparity map for the (from up to down:)
Tsukuba, Venus, Teddy and Cones image sets respectively.
B. Self-captured Image Sets
Furthermore, the algorithm has been applied on self-captured image pairs, Fig. 14-Fig. 16. The used pairs
suffer from typical outdoor enviroment’s issues. Apart from being shot from cameras dispaced both horizontally
and vertically, having not parallel directions they involve textureless areas and difficult lighting conditions.
Moreover, examination of Fig. 14(a), 14(b) and Fig. 16(a), 16(b) reveals that the different position of the
cameras result in lighting and chromatic differences.
June 20, 2010 DRAFT
15
TABLE I
CALCULATED NMSE FOR THE PROPOSED ALGORITHM FOR VARIOUS PAIRS WITH CONSTANT DISTORTION 10%
Pair NMSE Variation (%)
Distorted Original
Tsukuba 0.0712 0.0781 -0.097
Venus 0.0491 0.0461 +0.061
Teddy 0.1098 0.0976 +0.111
Cones 0.0500 0.0519 -0.038
(a) Distortion 0% (b) Distortion 2.5%
(c) Distortion 5% (d) Distortion 7.5%
(e) Distortion 10% (f) Distortion 12.5%
(g) Distortion 15%
Fig. 12. (from left to right:) The left and right distorted input images and the calculated final disparity maps for various percentages of
the induced lens distortion.
����
���
����
����
����
���
� �� � ��� �� ��� ��
NM
SE
Distortion Percentage
Fig. 13. The NMSE for the Tsukuba image pair for various distortion percentages.
June 20, 2010 DRAFT
16
(a) (b)
(c) (d) (e)
Fig. 14. (a), (b) The self-captured input images of an alley, and the resulting disparity maps for (c) the quadruple, (d) double and (e)
single pixel estimation respectively.
(a) (b)
(c) (d) (e)
Fig. 15. (a), (b) The self-captured input images of a building, and the resulting disparity maps for (c) the quadruple, (d) double and (e)
single pixel estimation respectively.
June 20, 2010 DRAFT
17
TABLE II
CALCULATED NMSE FOR THE PROPOSED ALGORITHM FOR THE TSUKUBA PAIR WITH VARIOUS DISTORTION PERCENTAGES
Distortortion (%) NMSE
0.0 0.0781
2.5 0.0712
5.0 0.0708
7.5 0.0663
10 0.0712
12.5 0.0723
15 0.0761
(a) (b)
(c) (d) (e)
Fig. 16. (a), (b) The self-captured input images of a corner, and the resulting disparity maps for (c) the quadruple, (d) double and (e)
single pixel estimation respectively.
VII. CONCLUSIONS AND DISCUSSION
In this work a disparity estimation technique is proposed. It is able to process non-rectified input images
from uncalibrated stereo cameras and at the same time retain low computational complexity. The hierarchical
search scheme is based on the JVT/H.264 motion estimation algorithm, initially developed for video coding.
The proposed algorithm searches for stereo correspondences inside D × D search blocks requiring, however,
significantly less computations than a typical full search.
Sophisticated methods and techniques, such as gaussian weighted aggregation and 3-D CA refinement rules
have been applied to a fuzzy-based hierarchical process. All the chosen modules of the presented algorithm can
be parallelized and pipelined, resulting into an efficient hardware implementation. Consequently, the presented
June 20, 2010 DRAFT
18
hardware architecture of the algorithm can make the proposed method suitable for autonomous, real-time outdoor
robotics applications. The use of the aforementioned refining techniques add complexity to the overall algorithm
providing however higher quality of results. In this sense, the trades off only some of its structural simplicity
for better results.
The proposed algorithm’s performance is retained practically unaffected by spatial displacements, lens distor-
tions or lighting asymmetries in the input images, as was qualitatively and quantitatively indicated. Moreover,
the ability to tolerate poorly or even not calibrated input images in conjunction with its speed and the presented
result quality, show that this algorithm can cope with the demanding issue of outdoor navigation.
ACKNOWLEDGMENT
This work was supported by the E.C. under the FP6 research project for vision and chemiresistor equipped
web-connected finding robots, ”View-Finder”, IST-2005-045541.
REFERENCES
[1] Konolige K, Agrawal M, Bolles RC, Cowan C, Fischler M, Gerkey BP. Outdoor Mapping and Navigation Using Stereo Vision. In:
Khatib O, Kumar V, Rus D, editors. ISER. vol. 39 of Springer Tracts in Advanced Robotics. Springer; 2006. p. 179–190.
[2] Soquet N, Aubert D, Hautiere N. Road Segmentation Supervised by an Extended V-Disparity Algorithm for Autonomous Navigation.
In: IEEE Intelligent Vehicles Symposium. Istanbul, Turkey; 2007. p. 160–165.
[3] Hogue A, German A, Jenkin M. Underwater environment reconstruction using stereo and inertial data. In: IEEE International
Conference on Systems, Man and Cybernetics. Montreal, Canada; 2007. p. 2372–2377.
[4] Klancar G, Kristan M, Karba R. Wide-angle camera distortions and non-uniform illumination in mobile robot tracking. Journal of
Robotics and Autonomous Systems. 2004;46:125–133.
[5] http://vision.middlebury.edu/stereo/; 2009.
[6] Nalpantidis L, Sirakoulis GC, Gasteratos A. In: Darzentas J, Vouros GA, Vosinakis S, Arnellos A, editors. A dense stereo
correspondence algorithm for hardware implementation with enhanced disparity selection. vol. 5138 of Lecture Notes in Computer
Science. Berlin-Heidelberg: Springer-Verlag; 2008. p. 365–370.
[7] Amanatiadis A, Andreadis I, Konstantinidis K. Design and Implementation of a Fuzzy Area-Based Image-Scaling Technique. IEEE
Transactions on Instrumentation and Measurement. 2008;57(8):1504–1513.
[8] Nalpantidis L, Sirakoulis GC, Gasteratos A. Review of Stereo Vision Algorithms: from Software to Hardware. International Journal
of Optomechatronics. 2008;2(4):435–462.
[9] Yin P, Tourapis HYC, Tourapis A, Boyce J. Fast mode decision and motion estimation for JVT/H.264. In: Proc. Int. Conf. Image
Process.. vol. 3; 2003. p. 853–856.
[10] Wiegand T, Sullivan G, Bjntegaard G, Luthra A. Overview of the H.264/AVC video coding standard. IEEE Trans Circuits Syst Video
Technol. 2003;13(7):560–576.
[11] Marr D, Poggio T. Cooperative computation of stereo disparity. Science. 1976;194(4262):283.
[12] Murray D, Little JJ. Using real-time stereo vision for mobile robot navigation. Autonomous Robots. 2000;8(2):161–171.
[13] Murray D, Jennings C. Stereo vision based mapping and navigation for mobile robots. In: Proc. IEEE Int. Conf. on Robotics and
Automation. vol. 2; 1997. p. 1694–1699.
[14] Jain R, Kasturi R, Schunck BG. Machine vision. McGraw-Hill; 1995.
[15] Gasteratos A, Sandini G. In: Vlahavas IP, Spyropoulos CD, editors. Factors Affecting the Accuracy of an Active Vision Head. vol.
2308 of Lecture Notes in Computer Science. Berlin-Heidelberg: Springer-Verlag; 2002. p. 413–422.
[16] De Cubber G, Doroftei D, Nalpantidis L, Sirakoulis GC, Gasteratos A. Stereo-based Terrain Traversability Analysis for Robot
Navigation. In: IARP/EURON Workshop on Robotics for Risky Interventions and Environmental Surveillance. Brussels, Belgium;
2008. .
[17] Schreer O. Stereo Vision-Based Navigation in Unknown Indoor Environment. In: 5th European Conference on Computer Vision.
vol. 1; 1998. p. 203–217.
June 20, 2010 DRAFT
19
[18] Scharstein D, Szeliski R. A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms. International Journal
of Computer Vision. 2002;47(1-3):7–42.
[19] Torra PHS, Criminisi A. Dense stereo using pivoted dynamic programming. Image and Vision Computing. 2004;22(10):795–806.
[20] Labayrade R, Aubert D, Tarel JP. Real time obstacle detection in stereovision on non flat road geometry through ”v-disparity”
representation. In: IEEE Intelligent Vehicle Symposium. vol. 2. Versailles, France; 2002. p. 646–651.
[21] Kelly A, Stentz AT. Stereo Vision Enhancements for Low-Cost Outdoor Autonomous Vehicles. In: International Conference on
Robotics and Automation, Workshop WS-7, Navigation of Outdoor Autonomous Vehicles, (ICRA ’98); 1998. .
[22] Zhao J, Katupitiya J, Ward J. Global correlation based ground plane estimation using v-disparity image. In: IEEE International
Conference on Robotics and Automation. Rome, Italy; 2007. p. 529–534.
[23] Agrawal M, Konolige KG, Bolles RC. Localization and Mapping for Autonomous Navigation in Outdoor Terrains: A Stereo Vision
Approach. In: IEEE Workshop on Applications of Computer Vision; 2007. p. 7–7.
[24] Zach C, Karner K, Bischof H. Hierarchical disparity estimation with programmable 3D hardware. In: Proc. Int. Conf. in Central
Europe on Computer Graphics, Visualization and Computer Vision; 2004. p. 275–282.
[25] Jeong H, Park S, Pohang K, Korea S. Generalized Trellis stereo matching with systolic array. Lecture Notes in Computer Science.
2004;3358:263–267.
[26] Park S, Jeong H, Pohang K, Korea S. Real-time stereo vision FPGA chip with low error rate. In: Proc. Int. Conf. on Multimedia
and Ubiquitous Engineering; 2007. p. 751–756.
[27] Masrani D, MacLean W. A Real-time large disparity range stereo system using FPGAs. In: Proc. IEEE Int. Conf. on Computer
Vision Systems; 2006. p. 13–13.
[28] Von Neumann J. Theory of Self-Reproducing Automata. Urbana, Illinois: University of Illinois Press; 1966.
[29] Chopard B, Droz M. Cellular Automata Modeling of Physical systems. Cambridge: Cambridge University Press; 1998.
[30] Ulam S. Random processes and transformations. In: International Congress on Mathematics. vol. 2. Cambridge, USA; 1952. p.
264–275.
[31] Feynman R. Simulating Physics with Computers. International Journal of Theoretical Physics. 1982;21(6):467–488.
[32] Wolfram S. Theory and Applications of Cellular Automata. Singapore: World Scientific; 1986.
[33] Mardiris V, Sirakoulis GC, Mizas C, Karafyllidis I, Thanailakis A. A CAD System for Modeling and Simulation of Computer
Networks Using Cellular Automata. IEEE Transactions on Systems, Man, and Cybernetics, Part C. 2008;38(2):253–264.
[34] Kotoulas L, Gasteratos A, Sirakoulis GC, Georgoulas C, Andreadis I. Enhancement of Fast Acquired Disparity Maps using a 1-D
Cellular Automation Filter. In: IASTED International Conference on Visualization, Imaging and Image Processing. Benidorm, Spain;
2005. p. 355–359.
[35] Sirakoulis GC, Karafyllidis I, Thanailakis A. A CAD system for the construction and VLSI implementation of Cellular Automata
algorithms using VHDL. Microprocessors and Microsystems. 2003;27(8):381–396.
[36] Thevenaz P, Blu T, Unser M. Interpolation revisited. IEEE Trans Med Imag. 2000;19(7):739–758.
[37] Ascia G, Catania V, Russo M. VLSI hardware architecture for complex fuzzy systems. IEEE Trans Fuzzy Syst. 1999;7(5):553–570.
June 20, 2010 DRAFT