Upload
david-levin
View
213
Download
0
Embed Size (px)
Citation preview
Acceleration of 3D, nonlinear warping using standard video graphics
hardware: implementation and initial validation
David Levina, Damini Deyc, Piotr J. Slomkaa,b,c,*
aDepartment of Medical Biophysics, University of Western Ontario, London, Ont., CanadabDepartment of Imaging and Medicine, #A047, AIM Program, Cedars-Sinai Medical Center, 8700 Beverly Blvd, Los Angeles, CA 90048, USA
cDepartment of Medicine, David Geffen School of Medicine, UCLA, Los Angeles, CA, USA
Received 19 April 2004; accepted 16 July 2004
Abstract
Thin Plate Spline (TPS) transformations are used in many medical imaging algorithms, but are time-prohibitive for iterative or interactive
use. We have utilized current generation consumer 3D graphics cards to accelerate the application of the TPS nonlinear transformation by
combining hardware accelerated 3D textures, vertex shaders and trilinear interpolation. Our hardware accelerated algorithm warped a 512!512!173 Computed Tomography (CT) dataset in 2.3 s, and the same study, scaled to 256!256!173, in 0.5 s, using a set of 92 landmarks.
An accelerated software implementation of the TPS transformation warped the 512!512!173 study in 32.9 s and the 256!256!173 study
in 9.5 s. Subtracted images were used for qualitative analysis of the warp and the Mutual Information (MI) and Sum of Absolute Difference
(SAD) similarity metrics were used to quantitatively compare the output of the hardware accelerated algorithm to that of the software
algorithm. The hardware accelerated algorithm provided a 7–65 times performance increase when compared to the software algorithm, and
produced output of comparable quality (MIO300 and SAD !1.75%).
q 2004 Elsevier Ltd. All rights reserved.
Keywords: Thin plate spline; Nonlinear; Warping; 3D; Hardware Acceleration
1. Introduction
The ability to compare multiple medical imaging scans,
acquired from single or multiple modalities, allows a
physician to review a combination of anatomical and
functional information simultaneously. It also allows them
to estimate the differences between two images more easily.
This can further increase their ability to make an accurate
diagnosis. A major difficulty arises when scans are acquired
at different times. Misalignment of the two images can make
their comparison too inaccurate to be of use. Volume-based
image registration algorithms have been investigated as a
means to correct for these misalignments, making direct
comparisons of two such scans possible [1].
0895-6111/$ - see front matter q 2004 Elsevier Ltd. All rights reserved.
doi:10.1016/j.compmedimag.2004.07.005
* Corresponding author. Address: Department of Imaging #A047, AIM
Program, Cedars-Sinai Medical Center, 8700 Beverly Blvd, Los Angeles,
CA 90048, USA. Tel.: C1 310 423 4348; fax: C1 310 412 0173.
E-mail address: [email protected] (P. J. Slomka).
There are two types of image registration algorithm,
linear and nonlinear. Due to the flexible nature of soft tissue,
organs such as the heart and the lungs tend to deform
nonlinearly rather than just be misaligned [2].
Compensating for these deformities requires a nonlinear
approach to registration [2]. The Thin Plate Spline (TPS)
algorithm is a common nonlinear transformation used to
solve such problems. Three dimensional TPS has been
successfully used to register lung studies [2], prostate and
pelvic Magnetic Resonance Imaging (MRI) volumes [3] and
contrast and non-contrast Digital Subtraction Angiography
(DSA) images [4]. Unfortunately, the long runtimes of
current TPS algorithms are prohibitive for iterative or
interactive use [2,3,5].
Although software optimization and approximation can
offer some increase in speed, an alternative avenue for
accelerating nonlinear warping is consumer graphics hard-
ware. Driven by the increase in popularity and complexity
of interactive video games, several companies have released
computer graphics cards which accelerate many 3D
Computerized Medical Imaging and Graphics 28 (2004) 471–483
www.elsevier.com/locate/compmedimag
D. Levin et al. / Computerized Medical Imaging and Graphics 28 (2004) 471–483472
operations such as trilinear interpolation and texture
mapping [6,7]. Experiments performed by the Interactive
Visualization Group at the University of Erlangen have
shown that graphics hardware can be used to accelerate
nonlinear warping of a volume set by treating the nonlinear
transformation as a series of piecewise, linear transform-
ations [8,9]. By exploiting hardware-based 3D texture
mapping and trilinear interpolation, order of magnitude
time enhancements are possible [8].
The newest development in consumer graphics chips, or
Visual Processing Units (VPU) is the ability to customize
the 3D rendering pipeline [10]. Prior to the current
generation of graphics processors, the rendering pipeline
was a fixed structure, accepting vertices and textures as data
and then acting upon that data using a set of predetermined
operations. Extending the pipeline with custom operations
was impossible. New advances in consumer graphics
hardware, such as shaders, have removed this limitation.
Shaders are custom assembly programs which are executed
as data moves through the rendering pipeline [11,12].
Combinations of shaders can be used to create a custom
rendering pipeline that performs non-standard operations on
vertices and pixels as they move through the rasterization
process. Shaders have recently been introduced to consumer
level graphics cards by ATI and Nvidia [11] and are
supported by the DirectX [13] and OpenGL [14] application
programming interfaces (APIs).
In this work we have exploited the programmability of
this graphics hardware to accelerate a nonlinear warping
algorithm that approximates the TPS transformation. We
extended upon the previous research in this area [8,9] by
comparing this warping algorithm to an accepted software
implementation of the TPS transformation in terms of speed
and image quality. This was done in order to provide initial
validation for its use in medical imaging applications.
Furthermore, all the presented results were obtained on
inexpensive consumer level graphics cards. Hardware
accelerated warping could be used in image registration
algorithms that require iterative warping of volume data. It
could also be used to allow interactive, nonlinear warping in
a clinical setting.
2. Methods
The hardware accelerated TPS algorithm can be divided
into two distinct sections: the calculation of the TPS
transformation and the application of this transformation to
volume data.
2.1. Algebraic solution to the TPS in three dimensions
The algebraic solution to the TPS transformation in three
dimensions is based on the solution in two dimensions [15]
The TPS transformation requires the definition of an
arbitrary number of landmark pairs. Each pair of landmarks
is comprised of a source landmark and a target landmark.
The mapping of the source landmarks to the target
landmarks is described by the following equation
ðxt; yt; ztÞ Z f ðxs; ys; zsÞ (1)
where
(xt, yt, zt) are the target landmark coordinates in Cartesian
space.
(xs, ys, zs) are the source landmark coordinates in
Cartesian space.
f(x, y, z) is a function that maps the source landmark onto
the target landmark.
The set of linear equations that constitutes the trans-
formation f(x, y, z) is comprised of five individual matrices.
The matrix P is a matrix of the source landmarks and the
matrix T is a matrix of the target landmarks:
P Z
1 xs0 ys0 zs0
1 xs1 ys1 zs1
1 xs2 ys2 zs2
« « « «
1 xsN ysN zsN
266666664
377777775
T Z
xt0 yt0 zt0
xt1 yt1 zt1
xt2 yt2 zt2
« « «
xtN ytN ztN
2666664
3777775 (2)
The transpose matrix PT is also required. The matrix K is
defined in terms of the distances between the source
landmarks
K Z
Uðr00Þ Uðr01Þ Uðr02Þ / Uðr0NÞ
Uðr10Þ / / / Uðr1NÞ
Uðr20Þ / / / Uðr2NÞ
« / / / «
UðrN0Þ UðrN1Þ UðrN2Þ / UðrNNÞ
266666664
377777775
(3)
where
rab Z jðxsa; ysa; zsaÞK ðxsb; ysb; zsbÞj; is the Euclidean
distance between two source landmarks.
U(r) is the radial basis function used for the TPS
calculation. In 3D U(r)Zr [16].
Finally, we define a 4!4 zero matrix, O. These matrices
can be combined into a single, larger matrix L:
L ZK P
PT 0
" #(4)
The TPS parameters are computed by evaluating the
following matrix equation:
LK1T Z V (5)
We solved this system of linear equations using Gaussian
Elimination [17]. The resulting matrix V has the following
D. Levin et al. / Computerized Medical Imaging and Graphics 28 (2004) 471–483 473
form:
V Z
xw0 yw0 zw0
xw1 yw1 zw1
xw2 yw2 zw2
« « «
xwN ywN zwN
tx ty tz
sxx sxy sxz
syx syy syz
szx szy szz
266666666666666664
377777777777777775
(6)
The first N rows of this matrix specify a set of 3D weights
that are used in the equation f(x, y, z,) (Eq. (1)). The bottom
4!3 section of Eq. (6) contains the parameters of a linear
transformation that can be expressed in the following form:
A Z
sxx syx szx tx
sxy syy szy ty
sxz syz szz tz
0 0 0 1
266664
377775 (7)
Subsequently, we can find the transformed position (x 0,
y 0, z 0) of each point (x, y, z) by evaluating Eq. (1) where:
f ðx;y;zÞZA
x
y
z
1
26664
37775C
XN
jZ0
ððxwj;ywj;zwjÞjðxsj;ysj;zsjÞKðx;y;zÞjÞ
(8)
2.2. TPS warping of volume data
The initial step in applying the TPS transformation to
volume data is to calculate the TPS linear transformation
and the nonlinear weights (Eq. (5)). There are two possible
manners in which the TPS transformation can be applied to
a volume. The first is by evaluating Eq. (8) for each voxel in
the source volume. This will compute its position in the
warped volume. Interpolation (nearest neighbour, trilinear
or tricubic) is used to calculate the intensities of voxels for
arbitrary, new voxel locations in the warped volume.
The second method is known as a grid transform [16]. A
grid transform is represented by a regular, 3D grid of
displacement vectors. Interpolation is used to transform
points that do not lie directly on the grid. This increases the
efficiency of the TPS transformation because Eq. (8) is only
evaluated at each grid point, not at each voxel. This means
the accuracy of the warp becomes dependent on the spacing
of the grid, and is therefore, inversely proportional to its
speed. In this study we used grid spacing ranging from 2!2
voxels to 128!128 voxels. We utilized trilinear
interpolation with both methods of applying the TPS
transformation.
2.3. Hardware acceleration of the TPS
In order to accelerate the TPS transformation using
graphics hardware we have implemented a modified grid
transform algorithm This algorithm uses the graphics card to
warp 3D texture coordinates at discrete vertices, and uses
trilinear interpolation to determine the intensities of voxels
between the warped points.
A texture is a pattern that is mapped to a polygon via the
process of texture mapping [18]. Common dimensions for
textures are 1D, 2D and 3D. A 1D texture is a line of color.
A 2D texture is an image. A 3D texture is a stack of images
and is similar to a volume of medical imaging data. In order
to map a texture to a polygon one must define texture
coordinates at each vertex of that polygon. Texture
coordinates are interpolated across the face of the polygon,
applying the texture smoothly to the entire surface of the
shape. The texture coordinates of each vertex must be
specified in the same number of dimensions as the texture
itself.
Our implementation utilizes the OpenGL 1.5 API [14]
and was written in CCC using Microsoft Visual
Studio.NET. The hardware accelerated TPS algorithm
requires two OpenGL extensions and one WGL extension
[19]. WGL is the Microsoft Windows, platform-specific
extension to OpenGL. However, since WGL functions have
counterparts on other operating systems our implementation
is not platform specific. This algorithm requires support for
the EXT_texture3D and ARB_vertex_program OpenGL
extensions and the WGL_pixel_format WGL extension.
The EXT_texture3D extension adds 3D texture support to
OpenGL. The ARB_vertex_program extension adds shader
capabilities to OpenGL. This allows the user to replace
OpenGL’s standard, fixed function transformation pipeline
with customized assembly programs. These programs
perform custom operations on each vertex as it is rendered.
In addition, the WGL_pixel_format extension adds Pixel
Buffer functionality to OpenGL. Pixel Buffers are hardware
accelerated, off-screen buffers used to increase the dynamic
texturing speed.
2.3.1. Algorithm details
Initially, the volume data is loaded into the graphics
card’s texture memory as a 3D texture This is done using the
glTexImage3D( ) method. OpenGL 3D textures have
several constraints. Firstly, every dimension of a 3D texture
must be a power of two. To accommodate this, each
dimension of the volume is expanded to the nearest power of
two by padding it with zero intensity voxels. Secondly,
current consumer video cards support a maximum of 32-bit
color. This provides an 8-bit range for each red, blue, green
and alpha component of a voxel. In order to reduce texture
memory usage, the volume data is loaded as a luminance
texture, requiring only a single 8-bit channel per voxel.
However, most medical data is stored as 16-bits per voxel,
therefore, the data must be scaled to 8-bits (256 intensity
Fig. 1. A diagram of the hardware accelerated TPS algorithm illustrating which sections of the algorithm are executed in hardware and which sections are
executed in software. The algorithm begins by using the source and target landmark sets to calculate the TPS transformation in software. The linear
transformation and the nonlinear weights are passed to an OpenGL vertex program. Using a previously prepared vertex grid, this vertex program applies the
warp on a slice-by-slice basis by rendering each slice of the volume to an OpenGL Pixel Buffer and then copying the warped slice to the final warped 3D texture
for display.
D. Levin et al. / Computerized Medical Imaging and Graphics 28 (2004) 471–483474
levels) when it is loaded as a texture. Having loaded the
volume data as a texture, a Pixel Buffer with dimensions
equal to the x and y dimensions of the volume is created. In
order to warp a dataset we calculate the TPS transformation;
using Eq. (5), we can calculate the linear transformation for
the TPS, as well as the weights for each landmark.
Next, we create an ordered grid, with arbitrary grid
spacing, for each slice in the volume. For each grid vertex,
the second sum term in Eq. (8) is calculated in software
(Fig. 1). This sum, along with the corresponding grid vertex
and linear TPS transformation (matrix A), is sent to the
OpenGL vertex program (Fig. 1). The program computes
warped texture coordinates for each vertex in a slice, as that
slice is drawn to the Pixel Buffer (Fig. 2A). When a volume
slice is drawn in this manner, the intensities of the pixels
Fig. 2. A diagram illustrating the calculation of new texture coordinates for a grid of
between the warped grid vertices are determined using
hardware-based trilinear interpolation. The warped slice can
then be read back into the OpenGL texture memory using
the glCopyTexSubImage3D( ) function. This process is
repeated for every slice in the volume to produce a warped
dataset.
Due to the nature of hardware trilinear interpolation we
need to explicitly triangulate our vertex grid [8]. Triangles
are the basic 3D primitives of consumer computer graphics
hardware. If a grid square is drawn using only four vertices,
the graphics card will automatically divide it into two
triangles (Fig. 2B). The nature of this division will lead to
the incorrect interpolation of the pixels inside the square.
We divide each grid square into four separate triangles by
using a central vertex. This is called explicit triangulation
vertices (A) and the difference between normal and explicit triangulation (B).
Table 1
A table showing the dimensions and pixel sizes of the two studies used to
validate the HW algorithm
Dimensions (voxels) Pixel size (mm)
X Y Z X Y Z
Study 1 512 512 173 0.6 0.6 1.25
Study 2 512 512 123 0.4 0.4 1.5
Study 1a 256 256 173 1.2 1.2 1.25
Study 2a 256 256 123 0.8 0.8 1.5
a Denotes studies scaled to 256!256 matrix size.
D. Levin et al. / Computerized Medical Imaging and Graphics 28 (2004) 471–483 475
(Fig. 2B) and it is necessary to ensure correct trilinear
interpolation in hardware [8,9].
2.4. Validation of hardware TPS
In order to determine the usefulness of the hardware TPS
algorithm (HW) we compared it to the Visualization Toolkit
(VTK) version 4.2 [20] software based TPS algorithm (SW),
and the VTK software grid transform algorithm (SWG) [16]
in terms of speed and quality. Validation was done using
two test studies (Table 1). Both test studies were standard
DICOM datasets.
The first study was a lung dataset containing two chest
CT volumes. The first volume was recorded during normal
breathing and exhibited significant motion artifacts in the
area of the chest and diaphragm. The second volume was of
identical dimensions and pixel sizes as the first, but was
acquired during inspiration and exhibited no motion
artifacts. The first volume was warped to the second volume
using a set of 92 manually selected landmarks (Fig. 3B).
The second study was a cardiac dataset containing two
CT volumes. The first volume was a CT acquired using an
intravenous contrast agent. The second volume of this study
was a non-contrast CT, taken during a different breath-hold,
Fig. 3. The landmark sets for the cardiac (A) and lung (B) test studies shown in 3D.
(red in web version) indicates target landmarks.
with identical pixel sizes and dimensions. A set of 28
landmarks (Fig. 3A) was manually selected in order to warp
the non-contrast CT to the contrast CT. Warped volumes
were created using all three TPS algorithms and compared.
Two different hardware configurations were used for
benchmarking (Table 2). Results for which no system is
specified were performed using System 1 only.
2.4.1. Quality comparison of hardware and software
based TPS
The quality of the HW algorithm was examined in
two ways. First, both test studies were warped using each
TPS algorithm. The HW algorithm and the SWG
algorithm were applied using seven different grid
spacings: 2!2, 4!4, 8!8, 16!16, 32!32, 64!64
and 128!128 voxels. The resulting volumes were
compared to those produced by the SW algorithm
using the Mutual Information (MI) and Sum of Absolute
Difference (SAD) cost functions. Secondly, the warped
volumes were compared to their respective target
volumes using the same two cost functions. In our
study we did not adjust the grid spacing in the z
direction because, for our test studies, the pixel spacing
in z was much larger than the pixel spacing in the x and
y directions (Table 1).
Two additional experiments were performed to deter-
mine the effects of different parameters on warping quality
and speed. First, Study 1 was scaled to 256 intensity values
and then warped using the SW algorithm. The warped study
was compared to the output of the HW algorithm in order to
estimate the effect of intensity scaling on the similarity of
the warped volumes.
Secondly, to assess the effect of matrix size on the quality
and the speed of the HW algorithm, both test studies were
scaled to 256!256 in the x and y directions (Table 1).
Light grey (green in web version) indicates source landmarks and dark grey
Table 2
The two benchmarking system configurations used for this paper
Processor RAM Graphics card Fill rate
(MTexels/s)
System 1 AMD Ath-
lon 1.2
Ghz
512 MB ATI Radeon
9700 Pro 128
MB DDR RAM
1482.4a
System 2 Intel Xeon
2.0 Ghz
1 GB Nvidia Geforce
FX 5600 256
MB DDR Ram
995.0b
a Ref. [27].b Ref. [28].
D. Levin et al. / Computerized Medical Imaging and Graphics 28 (2004) 471–483476
All three warping algorithms were applied and the speed
and similarity of the results were compared.
2.4.1.1. Sum of absolute difference. SAD provides an
intuitive way to compare two volumes on a voxel by
voxel basis [21,22] We calculated SAD in the following
manner
SAD Z
PZ1
PY1
PX1 absðV2ðx; y; zÞKV1ðx; y; zÞÞ
TðV1Þ$100%
(9)
where
X is the x dimension of the volume.
Y is the y dimension of the volume.
Z is the z dimension of the volume.
VA(x, y, z) is the intensity of the voxel at position (x, y, z)
in volume A.
T(VA) is the total intensity count of volume A.
Volumes that are closely aligned exhibit small SAD
values.
2.4.1.2. Mutual information. The Mutual Information is a
similarity measure based on information theory [23,24].
Fig. 4. The output of the HW algorithm for both test studies. Identical slices are sho
Study 1 and transverse slices are shown for Study 2. The grid spacing for these w
Mutual Information measure has been used in voxel-based
registration to find the transformation that best aligns two
datasets. This is done by maximizing the Mutual Infor-
mation of the two volumes. Mutual Information is defined as
Mðu; vÞ Z ðHðuÞCHðvÞKHðu; vÞÞ$100 (10)
where
HðwÞ ZKÐ
f ðwÞlogðf ðwÞÞdw (11)
f(w) is the probability density function of a random variable
w.
Hðu; vÞ ZKÐ Ð
f ðu; vÞlogðf ðu; vÞÞdu dv;
is the joint histogram of u and v.
2.4.2. Speed comparison of hardware and software
based TPS
The performance of each TPS algorithm (SW, SWG, and
HW) was measured as a function of the number of
landmarks used to calculate the TPS transformation For
the HW algorithm and the SWG algorithm, performance
was also measured as a function of the grid spacing.
In order to determine the effect of landmark quantity on
warping speed, both test studies were warped using each
TPS algorithm. Landmark lists were expanded to contain up
to 165 landmarks by adding new landmarks extrapolated
from the originals. This was done by adding random
displacement factors of between 0.5 and 2.0 mm in each
dimension, to each original landmark. The HW algorithm
and the SWG algorithm were applied with a fixed grid
spacing of 16!16 voxels for this test.
In order to test the effect of grid spacing on the speeds of
the HW and SWG algorithms, each test study was warped
using its corresponding landmark list. A series of warps was
performed using grid spacings ranging from 2!2 voxels to
wn for the source, warped and target volumes. Coronal slices are shown for
arps was 16!16 voxels.
Table 3
The calculated similarity between volumes warped using the hardware
(HW) and software (SWG) algorithms, and their corresponding target
volumes
Hardware Software
MI SAD (%) MI SAD (%)
Study 1 174.2 11.8 174.1 11.9
Study 2 52.4 52.7 52.8 52.9
D. Levin et al. / Computerized Medical Imaging and Graphics 28 (2004) 471–483 477
128!128 voxels and the duration of each warp was
recorded.
3. Results
Fig. 4 shows the results of warping both test studies using
the HW algorithm. A target image is shown for comparison
since landmarks were chosen to produce a warped volume
that approximated the target volume. After warping, the MI
between the warped and target volumes was calculated to be
174.2 for Study 1 and 52.4 for Study 2 (Table 3).
The calculated SAD values were 11.8% for Study 1 and
52.7% for Study 2 (Table 3). The similarity metrics were
also computed between volumes warped using the SWG
algorithm and their respective target volumes. These values
are shown in Table 3 for comparison with the HW values.
When Study 1 was warped using both the SW and the
SWG algorithms, a wedge shaped artifact (Fig. 5) was
Fig. 5. A figure illustrating the artifact created by the software (SW and SWG) algo
observed. In order to prevent this artifact from affecting our
results it was removed by clipping 7 voxels from the borders
of all Study 1 volumes. This was not necessary for the Study
2 volumes because no artifacts were created by the software
algorithms.
3.1. Quality comparison of hardware and software
based TPS
Visually comparing volumes warped using the HW
algorithm to those warped using the SW and SWG
algorithms showed almost identical results, even when
viewed under magnification (Fig. 6). However, subtracted
images illustrate the specific differences between the
outputs of the various algorithms. The HW algorithm
subtracted image shows a low intensity background noise
while the SWG image does not exhibit this effect (Fig. 6).
The similarity between the HW algorithm and the SW
algorithm decreased as the grid spacing of the HW
algorithm was increased (Fig. 7). However, the similarity
between the two warping techniques reached 99% of its
maximum MI value at a grid spacing of 16!16 voxels
(Fig. 7). The correspondence between the SWG algorithm
and the SW algorithm also varied inversely with the grid
spacing of the SWG algorithm. Though the correspondence
continued to increase as the grid spacing was decreased,
95% of its maximum MI was reached at a grid spacing of
16!16 voxels (Fig. 7). Based on comparison with the SW
algorithm, the HW algorithm was slightly more accurate
rithms when warping Study 1. The white, dashed box highlights the artifact.
Fig. 6. Images comparing the results of the SWG algorithm and the HW algorithm to the results of the SW algorithm. The second row of images shows a
magnified view of the area within the white, dashed square. The third row shows images created by subtracting each volume from the SW warped volume. The
fourth row shows a magnified view of the subtracted images.
Fig. 7. The calculated similarity between volumes warped using the
software (SWG) and hardware (HW) algorithms and volumes warped using
the SW algorithm, as a function of grid spacing. The two similarity metrics
computed were Mutual Information (higher is better) and SAD (lower is
better).
D. Levin et al. / Computerized Medical Imaging and Graphics 28 (2004) 471–483478
than the SWG algorithm at larger grid spacing (Fig. 7).
Table 4 shows the calculated similarities between the HW
and SW algorithms as well as the calculated similarities
between the SWG and SW algorithms.
The similarity between volumes warped using the SWG
and HW algorithms and the target volumes from both test
studies showed little change as grid spacing was increased.
This is evidenced by the flat lines on both the SAD and MI
graphs (Fig. 8).
The effects of intensity scaling on the quality of the HW
algorithm are shown in Table 5. Scaling the source volume
to 8-bit before applying the SW algorithm did not cause the
output of the SW algorithm to match that of the HW
algorithm more closely.
Table 4
The similarity of the software (SWG) and hardware (HW) warped studies
when compared to the SW warped studies
Study 1 Study 2
Hardware Software Hardware Software
Matrix 512 256 512 256 512 256 512 256
MI 323.7 324.3 335.9 316.7 312.0 301.9 342.4 316.7
SAD (%) 1.64 1.71 0.22 0.92 1.33 1.72 0.17 0.81
A grid spacing of 16!16 voxels was used for the software and hardware
warps. Values for volumes with 512!512 matrix size and volumes with
256!256 matrix size are shown.
Table 5
The similarity of HW warped volumes to 8-bit and 16-bit volumes warped
using the SW algorithm
Study 1 Study 2
8-bit 16-bit 8-bit 16-bit
MI 322.8 323.7 309.6 312.0
SAD (%) 3.71 1.64 1.51 1.33
Fig. 8. The calculated similarity between volumes warped using the
software (SWG) and the hardware (HW) algorithms and their respective
target volumes as a function of grid spacing. The similarity metrics
computed were Mutual Information (higher is better) and SAD (lower is
better).
D. Levin et al. / Computerized Medical Imaging and Graphics 28 (2004) 471–483 479
The quantitative results of warping the 256!256
volumes with the HW and SWG algorithms are shown in
Table 4. The MI and SAD values calculated between the
results of these algorithms and the results of the SW
Table 6
The warping time, in seconds, of the HW, SWG and SW algorithms for two test
Study 1
HW SWG SW
Matrix 512 256 512 256 512 25
System 1 2.3 0.5 32.9 9.5 292.7 77
System 2 2.6 0.7 22.8 5.8 138.2 32
Times are provided for volumes with 512!512 matrix size and volumes with 256!
16!16 voxels.
algorithm were still much higher than the MI and SAD
values calculated between the warped 512!512 volumes
and their corresponding target volumes (Table 3).
3.2. Speed comparison of hardware and software based TPS
The HW algorithm was substantially (between 7 and 65
times) faster than the SWG algorithm (Table 6, Fig. 9).
The speed differential depended on the number of land-
marks in the warp and the system configuration used to run
both algorithms. The HW algorithm was between 85 and 87
times faster than the SW algorithm. The runtimes of each of
the algorithms exhibited a linear dependency on the number
of landmarks; though for both the SWG algorithm and the
HW algorithm this relationship was fairly flat (Fig. 9).
The exact nature of this relationship was determined by
fitting a line to the data. The HW algorithm showed a
computing time increase of 8–12 ms per landmark. The
SWG algorithm showed an increase of 16–19 ms per
landmark. The software algorithms showed improved
performance on System 2 due to the increased processor
speed but the HW algorithm suffered an overall decrease in
performance due to the slower graphics card of System 2.
The fill rate shows the number of textured pixels (texels) the
graphics card can draw per second (Table 2) and is
therefore, a fitting indicator of performance in this
application.
The runtimes of both the SWG and HW algorithms
increased exponentially as grid spacing was decreased
(Fig. 10). This affect was more pronounced in the HW
algorithm. The faster processor of System 2 helped alleviate
this affect (Fig. 10).
A higher warping speed was obtained by decreasing the
matrix size of the source volumes (Table 6).
4. Discussion
4.1. Quality comparison of hardware and software TPS
Several interesting observations can be made from the
quality measurements performed on the HW algorithm
When the HW and the SWG algorithms were compared to
the SW algorithm, it became evident that at lower grid
spacing the SWG algorithm was a slightly closer match to
studies, on two different sets of hardware
Study 2
HW SWG SW
6 512 256 512 256 512 256
.2 0.6 0.2 23.5 7.0 77.6 19.5
.4 1.6 0.4 15.7 4.0 33.5 8.4
256 matrix size. The grid spacing selected for the HW and SWG warps was
Fig. 9. Runtimes of all three warping algorithms as a function of landmark
number, when applied to Study 1. The results for Study 2 were
proportionally lower than those for Study 1. The SWG algorithm and the
HW algorithm were applied with a grid spacing of 16!16 voxels. The top
graph provides a detailed comparison of the warping times of the HW
algorithm and the SWG algorithm. The bottom graph includes the SW
runtime.
Fig. 10. The effect of grid spacing on the warping time of the software
(SWG) and hardware (HW) algorithms, when applied to Study 1. The
results for Study 2 were proportionally lower than those for Study 1.
D. Levin et al. / Computerized Medical Imaging and Graphics 28 (2004) 471–483480
the SW algorithm (Fig. 7). However, as grid spacing was
increased (wR64!64), the HW algorithm started to better
approximate the SW algorithm (Fig. 7). This is due to the
explicit triangulation of the HW grid. The added midpoint
vertex of each grid square leads to more warped points at
each grid spacing. This explains the higher accuracy of
the HW algorithm at higher grid spacing. At lower grid
spacing this advantage is not as important because the size
of each grid square is decreased and the effect of the
midpoint vertex becomes negligible.
Both the HW and SWG algorithms reached a point at
which decreasing the grid spacing no longer had a large
impact on the MI or the SAD calculations (Fig. 7).
Decreasing the grid spacing below 16!16 voxels showed
a negligible increase in the correspondence between the SW
algorithm and both the HW and the SWG algorithms.
Visual comparison did not reveal any significant
differences between the outputs of the HW algorithm and
the target volumes (Fig. 4). Furthermore, the HW algorithm
produced results that were nearly indistinguishable from
those produced by both the SW, and the SWG algorithms
(Fig. 6). The subtracted images revealed subtle differences
in the output of all three warping algorithms, but showed
that they all produced the same overall effect (Fig. 6). The
most prominent visual effect was the low intensity
background noise in the subtracted HW image (Fig. 6).
This effect may be due to differences in the software
and hardware interpolation routines or in the manner in
which the TPS transformation (Eq. 5) is evaluated. VTK
uses eigenvector decomposition to evaluate Eq. (5) and this
could return slightly different results than our own Gaussian
Elimination method.
The MI and SAD calculated between the HW algorithm
and the SW algorithm, though lower than the values
calculated between the SWG algorithm and the SW
algorithm (Table 4, Fig. 7), were still very high; especially
when compared to the MI and SAD values calculated
between the warped and target volumes (Table 3, Fig. 8).
The MI and SAD values calculated between the warped and
target volumes were virtually identical, regardless of
whether the SWG algorithm or the HW algorithm was
used to warp the source volumes. This indicates that the
observed quantitative and qualitative differences in the
output of the HW algorithm are insignificant in real world
applications.
4.2. Speed comparison of hardware and software TPS
At grid spacings greater than or equal to 16!16 voxels,
the HW algorithm exhibited a substantial performance
advantage over both software algorithms (Table 6, Fig. 10).
As the grid spacing was decreased, the speed of the HW
algorithm approached that of the SWG algorithm but was
always faster for identical studies warped using identical
hardware. This may be because of the explicit triangulation
of the hardware grid. Due to the extra midpoint vertex in
D. Levin et al. / Computerized Medical Imaging and Graphics 28 (2004) 471–483 481
each grid square, the hardware accelerated TPS grid
contains more vertices than the software grid transform at
equivalent grid spacing. If the software grid contains N2
points than the hardware grid contains (2N2K2NC1)
vertices. The computed sum term from Eq. (8) involves
a distance calculation that must be performed for each
landmark in the warp, at each grid point. This calculation is
still performed in software and therefore, adding more grid
vertices causes the software component of the algorithm to
dominate the computing time. Thus, the advantage of
hardware acceleration is overshadowed and processor speed
becomes the bottleneck. A faster processor helps to alleviate
this problem, as shown by the benchmark results for System
2 (Fig. 10).
Based on the comparisons of the HW warped volumes to
the SW warped volumes, we can see that decreasing the grid
spacing below 16!16 voxels has little effect on the
accuracy of the transformation (Figs. 7 and 8). This allows
us to apply the HW algorithm using a grid spacing
(R16!16 voxels) that favors the hardware component of
the algorithm and is therefore, substantially faster than the
SWG algorithm (Table 6). This suggests that 16!16 voxels
is the optimal grid spacing for studies with a 512!512
matrix size and with sub-millimeter pixel spacing in the x
and y directions.
All the algorithms showed a linear speed increase in
relation to the number of landmarks supplied to the warp
(Fig. 9). The SW algorithm was the slowest and most
affected by the number of landmarks used, while the HW
algorithm was substantially faster than both software
algorithms (Fig. 9). The software algorithms received an
increase in performance from the faster processor of System
2. The HW algorithm also exhibited some dependence on
processor speed. Though the total performance of the HW
algorithm on System 2 was worse than on System 1, the
increase in computing time per landmark was less (Fig. 9).
By decreasing the matrix size of the source volumes, a
speed increase was achieved (Table 6). The calculated
values of MI and SAD for these warps (Table 4) were still
much higher than the same values calculated between the
warped and target volumes of both test studies (Table 3)
indicating that a grid spacing of 16!16 voxels is still
acceptable for volumes of this matrix and pixel size.
4.3. Comparison to other research
Previous research on fast volumetric deformation has
focused on either novel algorithms [9] that have not been
tested for use in medical imaging applications, or the
application of hardware acceleration to a specific problem
area such as neurosurgery [8], without evaluation of image
quality In neither case were the results quantitatively
validated by comparison with more accurate software
algorithms. Specifically, in Ref. [8], only piecewise linear
warping was proposed. In contrast, we utilize the TPS
nonlinear transformation. In addition, those results were
obtained using dedicated, expensive graphics workstations
and not widely available consumer graphics cards. In our
study we have developed a novel, hardware accelerated
implementation of the TPS transformation, which exploits
the latest developments in consumer graphics hardware.
We have validated our approach by using qualitative and
quantitative analysis to demonstrate that our algorithm can
produce results that are nearly identical to those of a
software TPS algorithm. Therefore, our hardware
accelerated algorithm could be used to provide a significant
increase in speed to any algorithm relying on the TPS
transformation or nonlinear warping. One potential
application is to accelerate the compensation for respiratory
motion artifacts in Digital Subtraction Angiography [4].
This idea could be extended for use with CT Angiography
(CTA) [25]. Nonlinear transformations have also been used
to align lung CT acquired during normal respiration with
high quality diagnostic CT acquired during inspiration [2]
and to assess intracranial brain shift [26]. The hardware
accelerated TPS algorithm has the potential to accelerate all
these applications.
Since the TPS transformation requires only a small set of
landmarks to cause a large change in the shape of a volume,
it is well suited for use in iterative registration algorithms.
One could continually translate each landmark until some
similarity metric (MI, SAD) was maximized. The hardware
accelerated TPS would benefit this type of algorithm greatly
because each iteration would achieve an order of magnitude
increase in speed [5]. Therefore, more precise algorithms,
which use more iterations, could be evaluated using our
accelerated warping algorithm.
4.4. Limitations
There is an important logistical consideration that arises
when hardware acceleration is used in the implementation
of any algorithm. Any such implementation becomes
dependent on the type of OpenGL graphics card used. If
this algorithm were to be used as part of a software package,
it would still need to provide redundant software algorithms
for use whenever such hardware was not available.
Examples include most laptops and lower-end desktop
machines.
Due to hardware limitations, the output of the hardware
accelerated warp contains only 256 discrete values. Though
these values are rescaled to reside between the maximum
and minimum values of the original CT, this rescaling step
introduces a reduction in the intensity resolution. This was
investigated as a cause of the differences between the HW
algorithm and both software algorithms. Our experiments
revealed that this scaling did not significantly affect the
quantitative similarity measures. Scaling the source
volumes and then applying the SW algorithm did not
produce a warped volume that was more similar to the
output of the HW algorithm (Table 5). Furthermore, since
the calculated similarity between the output of the HW
D. Levin et al. / Computerized Medical Imaging and Graphics 28 (2004) 471–483482
algorithm and the target volumes was nearly identical to
the similarity calculated between the output of the SWG
algorithm and the target volumes (Table 3), one can
conclude that intensity scaling does not significantly affect
the output of the HW algorithm.
One final limitation of current consumer graphics
hardware is the lack of hardware accelerated tricubic
filtering. Visual inspection shows that trilinear interpolation
produces acceptable results for image registration, but
future graphics cards may support this feature, removing
this limitation entirely.
5. Conclusion
We have shown that TPS volume warping can be
significantly accelerated using the latest advancements in
consumer graphics hardware. Furthermore, we have shown,
both quantitatively and qualitatively, that the results of this
hardware accelerated TPS algorithm are of comparable
quality to those produced by two different software
algorithms. This generic method will allow dramatic
acceleration of a wide variety of medical imaging
algorithms that utilize 3D image warping. Consequently, it
could be used to implement more accurate image
registration methods that are currently time-prohibitive.
6. Summary
We have exploited the latest consumer 3D graphics
hardware in order to implement a high performance Thin
Plate Spline (TPS) nonlinear volume warping algorithm.
Accelerated nonlinear warping could be used in iterative
image registration algorithms or in interactive clinical
applications. Hardware accelerated 3D texturing and
trilinear interpolation were combined with OpenGL
vertex programs to implement a modified grid transform
algorithm. This algorithm was used to apply the TPS
transformation to volume data using the ATI Radeon
9700 Pro and Nvidia Geforce FX 5600 graphics cards.
The hardware accelerated TPS algorithm was compared
to a public domain software implementation (the
Visualization Toolkit (VTK) version 4.2) in terms of
speed and quality. Mutual Information (MI) and Sum of
Absolute Difference (SAD) were employed, along with
qualitative observation, to gauge the accuracy of the
hardware accelerated TPS algorithm. The hardware
accelerated TPS algorithm was between 85 and 87
times faster than the software TPS algorithm and 7–65
times faster than an optimized software TPS algorithm.
The hardware accelerated TPS algorithm warped a 512!512!173 lung Computed Tomography (CT) dataset in
2.3 s, compared to 32.9 s for the optimized software TPS
algorithm. Ninety-two landmarks were used to compute
the TPS transformation. For this warp, the calculated MI
between the output of the hardware accelerated algorithm
and the output of the VTK software implementation was
323.7 and the SAD was 1.64%. An MI value of 335.9
and an SAD value of 0.22% were calculated between the
outputs of the optimized software algorithm and the VTK
software implementation. The same experiment was
repeated with the dataset scaled to 256!256!173.
This yielded warping times of 0.5 s for the hardware
accelerated algorithm, versus 9.5 s for the optimized
software algorithm. The MI between the output of the
hardware accelerated algorithm and that of the VTK
implementation was 324.3 and the SAD was 1.71%. The
MI calculated between the output of the optimized
software algorithm and that of the VTK implementation
was 316.7 and the SAD was 0.92%. Additional results,
using various numbers of landmarks, detail settings and
system configurations, are presented within the paper.
Our experiments showed that nonlinear warping can be
significantly accelerated using consumer 3D graphics
hardware and the results of such an implementation are
of comparable quality to those produced by an accepted
software method.
References
[1] Pluim JP, Maintz JB, Viergever MA. Mutual-information-based
registration of medical images: a survey. IEEE Trans Med Imaging
2003;22(8):986–1004.
[2] Slomka PJ, Dey D, Przetak C, Aladl UE, Baum RP. Automated 3-
dimensional registration of stand-alone (18)F-FDG whole-body PET
with CT. J Nucl Med 2003;44(7):1156–67.
[3] Fei B, Kemper C, Wilson DL. A comparative study of warping and
rigid body registration for the prostate and pelvic MR volumes.
Comput Med Imaging Graph 2003;27(4):267–81.
[4] Bentoutou Y, Taleb N, Chikr El Mezouar M, Taleb M, Jetto L. An
invariant approach for image registration in digital subtraction
angiography. Pattern Recognit 2002;35(12):2853–65.
[5] Mattes D, Haynor DR, Vesselle H, Lewellen TK, Eubank W. PET-CT
image registration in the chest using free-form deformations. IEEE
Trans Med Imaging 2003;22(1):120–8.
[6] Spitzer J. Nvidia OpenGL Performance FAQ. www.nvidia.com; 2002.
Nvidia. 9-12-2003.
[7] ATI. Radeon 9500/9600/9700/9800 OpenGL Programming and
Optimization Guide. www.ati.com; 2004. ATI. 2-2-2004.
[8] Rezk-Salama C, Hastreiter P, Greiner G, Ertl T. Non-linear
registration of pre-and intraoperative volume data based on piecewise
linear transformations. Proceedings of Erlangen workshop of vision,
modelling, and visualization (VMV); University of Erlangen; 1999, p.
365–72.
[9] Rezk-Salama C, Scheuering M, Soza G, Greiner G. Fast volumetric
deformation on general purpose hardware. Proceedings of the ACM
SIGGRAPH/EUROGRAPHICS workshop on graphics hardware. Los
Angeles, CA: ACM Press; 2001 p. 17–24.
[10] Lal Shimpi A. 3DLabs’ P10 Visual processing unit—when a CPU and
GPU collide. www.anandtech.com; 2002. AnandTech. 3-20-2004.
[11] Proudfoot K, Mark WR, Svetoslav T, Hanrahan P. A real-time
procedural shading system for programmable graphics hardware.
Proceedings of the 28th annual conference on computer graphics and
interactive techniques. Los Angeles, CA: ACM Press; 2001 p.
159–70.
D. Levin et al. / Computerized Medical Imaging and Graphics 28 (2004) 471–483 483
[12] Lindholm E, Kligard MJ, Moreton H. A user-programmable vertex
engine. Proceedings of the 28th annual conference on computer
graphics and interactive techniques. Los Angeles, CA: ACM Press;
2001 p. 159–70.
[13] Microsoft. DirectX Programming Guide. www.microsoft.com; 2003.
Microsoft. 2-13-2004.
[14] Segal M, Akeley K. The OpenGL graphics system: a specification
(Version 1.5). www.opengl.org; 2003. Silicon Graphics Inc.
9-27-2003.
[15] Bookstein FL. Principal warps: thin plate spline and the decompo-
sition of deformations. IEEE Trans Pattern Anal 1989;11(6):567–85.
[16] Gobbi DG, Peters TM. Generalized 3D nonlinear transformations for
medical imaging: an object-oriented implementation in VTK. Comput
Med Imaging Graph 2003;27(4):255–65.
[17] Press WH, Teukolsky SA, Vetterling WT, Flannery BP. Numerical
recipes in C, 2nd ed. Cambridge: Press Syndicate of the University of
Cambridge; 2002 p. 36.
[18] Foley JD, van Dam A, Feiner SK, Hughes JF. Computer graphics:
principles and practice, 2nd ed. New York, NY, USA: Addison-
Wesley; 1997 p. 741.
[19] OpenGL Architecture Review Board (ARB). The openGL reference
manual, 3rd ed. New York: Addison-Wesley; 1999.
[20] Schroeder W, Martin K, Lorensen B. The visualization toolkit an
object-oriented approach to 3D graphics, 3rd ed.: Kitware Inc.; 2004.
[21] Slomka PJ, Hurwitz GA, Clement G, Stephenson J. Three-
dimensional demarcation of perfusion zones corresponding to specific
coronary arteries: application for automated interpretation of
myocardial SPECT. J Nucl Med 1995;36(11):2120–6.
[22] Hoh CK, Dahlbom M, Harris G, Choi Y, Hawkins RA,
Phelps ME, et al. Automated iterative three-dimensional regis-
tration of positron emission tomography images. J Nucl Med 1993;
34(11):2009–18.
[23] Maes F, Collignon A, Vandermeulen D, Marchal G, Suetens P.
Multimodality image registration by maximization of mutual
information. IEEE Trans Med Imaging 1997;16(2):187–98.
[24] Slomka PJ. Software approach to merging molecular with anatomic
information. J Nucl Med 2004;45(1):36–45.
[25] Jayakrishnan VK, White PM, Aitken D, Crane P, McMahon AD,
Teasdale EM. Subtraction helical CT angiography of intra-and
extracranial vessels: technical considerations and preliminary experi-
ence. Am J Neuroradiol 2003;24(3):451–5.
[26] Trantakis C, Tittgemeyer M, Schneider JP, Lindner D, Winkler D,
Strauss G, et al. Investigation of time-dependency of intracranial brain
shift and its relation to the extent of tumor removal using intra-
operative MRI. Neurol Res 2003;25(1):9–12.
[27] Justice B. 9700 Pro vs. 9800 Pro. www.hardocp.com; 2003.
HardOCP. 3-30-2004.
[28] Pelletier S. GeForceFX 5200 and 5600. www.hardocp.com; 2003.
HardOCP. 3-30-2004.
David Levin received his BSc in Computer Science and Biology from
the University of Western Ontario, Canada in 2002 and has begun
studying for his MSc in Medical Biophysics at the Robarts Research
Institute. His main interests are interactive 3D graphics, image
processing and analysis.
Damini Dey received her BSc Honours in Physics at the University of
Saskatchewan, Canada in 1988, and her MSc and PhD in Medical
Physics at the University of Calgary, Canada in 1992 and 1998. She did
her Post-Doctoral training at the Imaging Research Laboratories,
Robarts Research Institute, University of Western Ontario, Canada in
image-guided neurosurgery. She is currently a Research Scientist at the
Cedars Sinai Medical Center, Los Angeles, CA. Her research areas are
in-vivo plaque imaging and quantification, and image registration and
fusion.
Piotr J. Slomka received his MASc in Computer Engineering from the
Warsaw University of Technology, Poland, in 1989 and his PhD in
Medical Physics from the University of Western Ontario, Canada in
1995. He is currently a faculty scientist with the Cedars Sinai Medical
Center, Los Angeles, CA and is an Associate Professor of Medicine at
the University of California, Los Angeles. His principal research areas
are image registration, fusion, and automated medical image analysis.