Acceleration of 3D, nonlinear warping using standard video graphics hardware: implementation and initial validation

Acceleration of 3D, nonlinear warping using standard video graphics

hardware: implementation and initial validation

David Levina, Damini Deyc, Piotr J. Slomkaa,b,c,*

aDepartment of Medical Biophysics, University of Western Ontario, London, Ont., CanadabDepartment of Imaging and Medicine, #A047, AIM Program, Cedars-Sinai Medical Center, 8700 Beverly Blvd, Los Angeles, CA 90048, USA

cDepartment of Medicine, David Geffen School of Medicine, UCLA, Los Angeles, CA, USA

Received 19 April 2004; accepted 16 July 2004

Abstract

Thin Plate Spline (TPS) transformations are used in many medical imaging algorithms, but are time-prohibitive for iterative or interactive

use. We have utilized current generation consumer 3D graphics cards to accelerate the application of the TPS nonlinear transformation by

combining hardware accelerated 3D textures, vertex shaders and trilinear interpolation. Our hardware accelerated algorithm warped a 512!512!173 Computed Tomography (CT) dataset in 2.3 s, and the same study, scaled to 256!256!173, in 0.5 s, using a set of 92 landmarks.

An accelerated software implementation of the TPS transformation warped the 512!512!173 study in 32.9 s and the 256!256!173 study

in 9.5 s. Subtracted images were used for qualitative analysis of the warp and the Mutual Information (MI) and Sum of Absolute Difference

(SAD) similarity metrics were used to quantitatively compare the output of the hardware accelerated algorithm to that of the software

algorithm. The hardware accelerated algorithm provided a 7–65 times performance increase when compared to the software algorithm, and

produced output of comparable quality (MIO300 and SAD !1.75%).

q 2004 Elsevier Ltd. All rights reserved.

Keywords: Thin plate spline; Nonlinear; Warping; 3D; Hardware Acceleration

1. Introduction

The ability to compare multiple medical imaging scans,

acquired from single or multiple modalities, allows a

physician to review a combination of anatomical and

functional information simultaneously. It also allows them

to estimate the differences between two images more easily.

This can further increase their ability to make an accurate

diagnosis. A major difficulty arises when scans are acquired

at different times. Misalignment of the two images can make

their comparison too inaccurate to be of use. Volume-based

image registration algorithms have been investigated as a

means to correct for these misalignments, making direct

comparisons of two such scans possible [1].

0895-6111/$ - see front matter q 2004 Elsevier Ltd. All rights reserved.

doi:10.1016/j.compmedimag.2004.07.005

* Corresponding author. Address: Department of Imaging #A047, AIM

Program, Cedars-Sinai Medical Center, 8700 Beverly Blvd, Los Angeles,

CA 90048, USA. Tel.: C1 310 423 4348; fax: C1 310 412 0173.

E-mail address: [email protected] (P. J. Slomka).

There are two types of image registration algorithm,

linear and nonlinear. Due to the flexible nature of soft tissue,

organs such as the heart and the lungs tend to deform

nonlinearly rather than just be misaligned [2].

Compensating for these deformities requires a nonlinear

approach to registration [2]. The Thin Plate Spline (TPS)

algorithm is a common nonlinear transformation used to

solve such problems. Three dimensional TPS has been

successfully used to register lung studies [2], prostate and

pelvic Magnetic Resonance Imaging (MRI) volumes [3] and

contrast and non-contrast Digital Subtraction Angiography

(DSA) images [4]. Unfortunately, the long runtimes of

current TPS algorithms are prohibitive for iterative or

interactive use [2,3,5].

Although software optimization and approximation can

offer some increase in speed, an alternative avenue for

accelerating nonlinear warping is consumer graphics hard-

ware. Driven by the increase in popularity and complexity

of interactive video games, several companies have released

computer graphics cards which accelerate many 3D

Computerized Medical Imaging and Graphics 28 (2004) 471–483

www.elsevier.com/locate/compmedimag

http://www.elsevier.com/locate/compmedimag

D. Levin et al. / Computerized Medical Imaging and Graphics 28 (2004) 471–483472

operations such as trilinear interpolation and texture

mapping [6,7]. Experiments performed by the Interactive

Visualization Group at the University of Erlangen have

shown that graphics hardware can be used to accelerate

nonlinear warping of a volume set by treating the nonlinear

transformation as a series of piecewise, linear transform-

ations [8,9]. By exploiting hardware-based 3D texture

mapping and trilinear interpolation, order of magnitude

time enhancements are possible [8].

The newest development in consumer graphics chips, or

Visual Processing Units (VPU) is the ability to customize

the 3D rendering pipeline [10]. Prior to the current

generation of graphics processors, the rendering pipeline

was a fixed structure, accepting vertices and textures as data

and then acting upon that data using a set of predetermined

operations. Extending the pipeline with custom operations

was impossible. New advances in consumer graphics

hardware, such as shaders, have removed this limitation.

Shaders are custom assembly programs which are executed

as data moves through the rendering pipeline [11,12].

Combinations of shaders can be used to create a custom

rendering pipeline that performs non-standard operations on

vertices and pixels as they move through the rasterization

process. Shaders have recently been introduced to consumer

level graphics cards by ATI and Nvidia [11] and are

supported by the DirectX [13] and OpenGL [14] application

programming interfaces (APIs).

In this work we have exploited the programmability of

this graphics hardware to accelerate a nonlinear warping

algorithm that approximates the TPS transformation. We

extended upon the previous research in this area [8,9] by

comparing this warping algorithm to an accepted software

implementation of the TPS transformation in terms of speed

and image quality. This was done in order to provide initial

validation for its use in medical imaging applications.

Furthermore, all the presented results were obtained on

inexpensive consumer level graphics cards. Hardware

accelerated warping could be used in image registration

algorithms that require iterative warping of volume data. It

could also be used to allow interactive, nonlinear warping in

a clinical setting.

2. Methods

The hardware accelerated TPS algorithm can be divided

into two distinct sections: the calculation of the TPS

transformation and the application of this transformation to

volume data.

2.1. Algebraic solution to the TPS in three dimensions

The algebraic solution to the TPS transformation in three

dimensions is based on the solution in two dimensions [15]

The TPS transformation requires the definition of an

arbitrary number of landmark pairs. Each pair of landmarks

is comprised of a source landmark and a target landmark.

The mapping of the source landmarks to the target

landmarks is described by the following equation

ðxt; yt; ztÞ Z f ðxs; ys; zsÞ (1)

where

(xt, yt, zt) are the target landmark coordinates in Cartesian

space.

(xs, ys, zs) are the source landmark coordinates in

Cartesian space.

f(x, y, z) is a function that maps the source landmark onto

the target landmark.

The set of linear equations that constitutes the trans-

formation f(x, y, z) is comprised of five individual matrices.

The matrix P is a matrix of the source landmarks and the

matrix T is a matrix of the target landmarks:

P Z

1 xs0 ys0 zs0

1 xs1 ys1 zs1

1 xs2 ys2 zs2

« « « «

1 xsN ysN zsN

266666664

377777775

T Z

xt0 yt0 zt0

xt1 yt1 zt1

xt2 yt2 zt2

« « «

xtN ytN ztN

2666664

3777775 (2)

The transpose matrix PT is also required. The matrix K is

defined in terms of the distances between the source

landmarks

K Z

Uðr00Þ Uðr01Þ Uðr02Þ / Uðr0NÞ

Uðr10Þ / / / Uðr1NÞ

Uðr20Þ / / / Uðr2NÞ

« / / / «

UðrN0Þ UðrN1Þ UðrN2Þ / UðrNNÞ

266666664

377777775

(3)

where

rab Z jðxsa; ysa; zsaÞK ðxsb; ysb; zsbÞj; is the Euclidean

distance between two source landmarks.

U(r) is the radial basis function used for the TPS

calculation. In 3D U(r)Zr [16].

Finally, we define a 4!4 zero matrix, O. These matrices

can be combined into a single, larger matrix L:

L ZK P

PT 0

" #(4)

The TPS parameters are computed by evaluating the

following matrix equation:

LK1T Z V (5)

We solved this system of linear equations using Gaussian

Elimination [17]. The resulting matrix V has the following

D. Levin et al. / Computerized Medical Imaging and Graphics 28 (2004) 471–483 473

form:

V Z

xw0 yw0 zw0

xw1 yw1 zw1

xw2 yw2 zw2

« « «

xwN ywN zwN

tx ty tz

sxx sxy sxz

syx syy syz

szx szy szz

266666666666666664

377777777777777775

(6)

The first N rows of this matrix specify a set of 3D weights

that are used in the equation f(x, y, z,) (Eq. (1)). The bottom

4!3 section of Eq. (6) contains the parameters of a linear

transformation that can be expressed in the following form:

A Z

sxx syx szx tx

sxy syy szy ty

sxz syz szz tz

0 0 0 1

266664

377775 (7)

Subsequently, we can find the transformed position (x 0,

y 0, z 0) of each point (x, y, z) by evaluating Eq. (1) where:

f ðx;y;zÞZA

x

y

z

1

26664

37775C

XN

jZ0

ððxwj;ywj;zwjÞjðxsj;ysj;zsjÞKðx;y;zÞjÞ

(8)

2.2. TPS warping of volume data

The initial step in applying the TPS transformation to

volume data is to calculate the TPS linear transformation

and the nonlinear weights (Eq. (5)). There are two possible

manners in which the TPS transformation can be applied to

a volume. The first is by evaluating Eq. (8) for each voxel in

the source volume. This will compute its position in the

warped volume. Interpolation (nearest neighbour, trilinear

or tricubic) is used to calculate the intensities of voxels for

arbitrary, new voxel locations in the warped volume.

The second method is known as a grid transform [16]. A

grid transform is represented by a regular, 3D grid of

displacement vectors. Interpolation is used to transform

points that do not lie directly on the grid. This increases the

efficiency of the TPS transformation because Eq. (8) is only

evaluated at each grid point, not at each voxel. This means

the accuracy of the warp becomes dependent on the spacing

of the grid, and is therefore, inversely proportional to its

speed. In this study we used grid spacing ranging from 2!2

voxels to 128!128 voxels. We utilized trilinear

interpolation with both methods of applying the TPS

transformation.

2.3. Hardware acceleration of the TPS

In order to accelerate the TPS transformation using

graphics hardware we have implemented a modified grid

transform algorithm This algorithm uses the graphics card to

warp 3D texture coordinates at discrete vertices, and uses

trilinear interpolation to determine the intensities of voxels

between the warped points.

A texture is a pattern that is mapped to a polygon via the

process of texture mapping [18]. Common dimensions for

textures are 1D, 2D and 3D. A 1D texture is a line of color.

A 2D texture is an image. A 3D texture is a stack of images

and is similar to a volume of medical imaging data. In order

to map a texture to a polygon one must define texture

coordinates at each vertex of that polygon. Texture

coordinates are interpolated across the face of the polygon,

applying the texture smoothly to the entire surface of the

shape. The texture coordinates of each vertex must be

specified in the same number of dimensions as the texture

itself.

Our implementation utilizes the OpenGL 1.5 API [14]

and was written in CCC using Microsoft Visual

Studio.NET. The hardware accelerated TPS algorithm

requires two OpenGL extensions and one WGL extension

[19]. WGL is the Microsoft Windows, platform-specific

extension to OpenGL. However, since WGL functions have

counterparts on other operating systems our implementation

is not platform specific. This algorithm requires support for

the EXT_texture3D and ARB_vertex_program OpenGL

extensions and the WGL_pixel_format WGL extension.

The EXT_texture3D extension adds 3D texture support to

OpenGL. The ARB_vertex_program extension adds shader

capabilities to OpenGL. This allows the user to replace

OpenGL’s standard, fixed function transformation pipeline

with customized assembly programs. These programs

perform custom operations on each vertex as it is rendered.

In addition, the WGL_pixel_format extension adds Pixel

Buffer functionality to OpenGL. Pixel Buffers are hardware

accelerated, off-screen buffers used to increase the dynamic

texturing speed.

2.3.1. Algorithm details

Initially, the volume data is loaded into the graphics

card’s texture memory as a 3D texture This is done using the

glTexImage3D( ) method. OpenGL 3D textures have

several constraints. Firstly, every dimension of a 3D texture

must be a power of two. To accommodate this, each

dimension of the volume is expanded to the nearest power of

two by padding it with zero intensity voxels. Secondly,

current consumer video cards support a maximum of 32-bit

color. This provides an 8-bit range for each red, blue, green

and alpha component of a voxel. In order to reduce texture

memory usage, the volume data is loaded as a luminance

texture, requiring only a single 8-bit channel per voxel.

However, most medical data is stored as 16-bits per voxel,

therefore, the data must be scaled to 8-bits (256 intensity

Fig. 1. A diagram of the hardware accelerated TPS algorithm illustrating which sections of the algorithm are executed in hardware and which sections are

executed in software. The algorithm begins by using the source and target landmark sets to calculate the TPS transformation in software. The linear

transformation and the nonlinear weights are passed to an OpenGL vertex program. Using a previously prepared vertex grid, this vertex program applies the

warp on a slice-by-slice basis by rendering each slice of the volume to an OpenGL Pixel Buffer and then copying the warped slice to the final warped 3D texture

for display.


levels) when it is loaded as a texture. Having loaded the

volume data as a texture, a Pixel Buffer with dimensions

equal to the x and y dimensions of the volume is created. In

order to warp a dataset we calculate the TPS transformation;

using Eq. (5), we can calculate the linear transformation for

the TPS, as well as the weights for each landmark.

Next, we create an ordered grid, with arbitrary grid

spacing, for each slice in the volume. For each grid vertex,

the second sum term in Eq. (8) is calculated in software

(Fig. 1). This sum, along with the corresponding grid vertex

and linear TPS transformation (matrix A), is sent to the

OpenGL vertex program (Fig. 1). The program computes

warped texture coordinates for each vertex in a slice, as that

slice is drawn to the Pixel Buffer (Fig. 2A). When a volume

slice is drawn in this manner, the intensities of the pixels

Fig. 2. A diagram illustrating the calculation of new texture coordinates for a grid of

between the warped grid vertices are determined using

hardware-based trilinear interpolation. The warped slice can

then be read back into the OpenGL texture memory using

the glCopyTexSubImage3D( ) function. This process is

repeated for every slice in the volume to produce a warped

dataset.

Due to the nature of hardware trilinear interpolation we

need to explicitly triangulate our vertex grid [8]. Triangles

are the basic 3D primitives of consumer computer graphics

hardware. If a grid square is drawn using only four vertices,

the graphics card will automatically divide it into two

triangles (Fig. 2B). The nature of this division will lead to

the incorrect interpolation of the pixels inside the square.

We divide each grid square into four separate triangles by

using a central vertex. This is called explicit triangulation

vertices (A) and the difference between normal and explicit triangulation (B).

Table 1

A table showing the dimensions and pixel sizes of the two studies used to

validate the HW algorithm

Dimensions (voxels) Pixel size (mm)

X Y Z X Y Z

Study 1 512 512 173 0.6 0.6 1.25

Study 2 512 512 123 0.4 0.4 1.5

Study 1a 256 256 173 1.2 1.2 1.25

Study 2a 256 256 123 0.8 0.8 1.5

a Denotes studies scaled to 256!256 matrix size.


(Fig. 2B) and it is necessary to ensure correct trilinear

interpolation in hardware [8,9].

2.4. Validation of hardware TPS

In order to determine the usefulness of the hardware TPS

algorithm (HW) we compared it to the Visualization Toolkit

(VTK) version 4.2 [20] software based TPS algorithm (SW),

and the VTK software grid transform algorithm (SWG) [16]

in terms of speed and quality. Validation was done using

two test studies (Table 1). Both test studies were standard

DICOM datasets.

The first study was a lung dataset containing two chest

CT volumes. The first volume was recorded during normal

breathing and exhibited significant motion artifacts in the

area of the chest and diaphragm. The second volume was of

identical dimensions and pixel sizes as the first, but was

acquired during inspiration and exhibited no motion

artifacts. The first volume was warped to the second volume

using a set of 92 manually selected landmarks (Fig. 3B).

The second study was a cardiac dataset containing two

CT volumes. The first volume was a CT acquired using an

intravenous contrast agent. The second volume of this study

was a non-contrast CT, taken during a different breath-hold,

Fig. 3. The landmark sets for the cardiac (A) and lung (B) test studies shown in 3D.

(red in web version) indicates target landmarks.

with identical pixel sizes and dimensions. A set of 28

landmarks (Fig. 3A) was manually selected in order to warp

the non-contrast CT to the contrast CT. Warped volumes

were created using all three TPS algorithms and compared.

Two different hardware configurations were used for

benchmarking (Table 2). Results for which no system is

specified were performed using System 1 only.

2.4.1. Quality comparison of hardware and software

based TPS

The quality of the HW algorithm was examined in

two ways. First, both test studies were warped using each

TPS algorithm. The HW algorithm and the SWG

algorithm were applied using seven different grid

spacings: 2!2, 4!4, 8!8, 16!16, 32!32, 64!64

and 128!128 voxels. The resulting volumes were

compared to those produced by the SW algorithm

using the Mutual Information (MI) and Sum of Absolute

Difference (SAD) cost functions. Secondly, the warped

volumes were compared to their respective target

volumes using the same two cost functions. In our

study we did not adjust the grid spacing in the z

direction because, for our test studies, the pixel spacing

in z was much larger than the pixel spacing in the x and

y directions (Table 1).

Two additional experiments were performed to deter-

mine the effects of different parameters on warping quality

and speed. First, Study 1 was scaled to 256 intensity values

and then warped using the SW algorithm. The warped study

was compared to the output of the HW algorithm in order to

estimate the effect of intensity scaling on the similarity of

the warped volumes.

Secondly, to assess the effect of matrix size on the quality

and the speed of the HW algorithm, both test studies were

scaled to 256!256 in the x and y directions (Table 1).

Light grey (green in web version) indicates source landmarks and dark grey

Table 2

The two benchmarking system configurations used for this paper

Processor RAM Graphics card Fill rate

(MTexels/s)

System 1 AMD Ath-

lon 1.2

Ghz

512 MB ATI Radeon

9700 Pro 128

MB DDR RAM

1482.4a

System 2 Intel Xeon

2.0 Ghz

1 GB Nvidia Geforce

FX 5600 256

MB DDR Ram

995.0b

a Ref. [27].b Ref. [28].


All three warping algorithms were applied and the speed

and similarity of the results were compared.

2.4.1.1. Sum of absolute difference. SAD provides an

intuitive way to compare two volumes on a voxel by

voxel basis [21,22] We calculated SAD in the following

manner

SAD Z

PZ1

PY1

PX1 absðV2ðx; y; zÞKV1ðx; y; zÞÞ

TðV1Þ$100%

(9)

where

X is the x dimension of the volume.

Y is the y dimension of the volume.

Z is the z dimension of the volume.

VA(x, y, z) is the intensity of the voxel at position (x, y, z)

in volume A.

T(VA) is the total intensity count of volume A.

Volumes that are closely aligned exhibit small SAD

values.

2.4.1.2. Mutual information. The Mutual Information is a

similarity measure based on information theory [23,24].

Fig. 4. The output of the HW algorithm for both test studies. Identical slices are sho

Study 1 and transverse slices are shown for Study 2. The grid spacing for these w

Mutual Information measure has been used in voxel-based

registration to find the transformation that best aligns two

datasets. This is done by maximizing the Mutual Infor-

mation of the two volumes. Mutual Information is defined as

Mðu; vÞ Z ðHðuÞCHðvÞKHðu; vÞÞ$100 (10)

where

HðwÞ ZKÐ

f ðwÞlogðf ðwÞÞdw (11)

f(w) is the probability density function of a random variable

w.

Hðu; vÞ ZKÐ Ð

f ðu; vÞlogðf ðu; vÞÞdu dv;

is the joint histogram of u and v.

2.4.2. Speed comparison of hardware and software

based TPS

The performance of each TPS algorithm (SW, SWG, and

HW) was measured as a function of the number of

landmarks used to calculate the TPS transformation For

the HW algorithm and the SWG algorithm, performance

was also measured as a function of the grid spacing.

In order to determine the effect of landmark quantity on

warping speed, both test studies were warped using each

TPS algorithm. Landmark lists were expanded to contain up

to 165 landmarks by adding new landmarks extrapolated

from the originals. This was done by adding random

displacement factors of between 0.5 and 2.0 mm in each

dimension, to each original landmark. The HW algorithm

and the SWG algorithm were applied with a fixed grid

spacing of 16!16 voxels for this test.

In order to test the effect of grid spacing on the speeds of

the HW and SWG algorithms, each test study was warped

using its corresponding landmark list. A series of warps was

performed using grid spacings ranging from 2!2 voxels to

wn for the source, warped and target volumes. Coronal slices are shown for

arps was 16!16 voxels.

Table 3

The calculated similarity between volumes warped using the hardware

(HW) and software (SWG) algorithms, and their corresponding target

volumes

Hardware Software

MI SAD (%) MI SAD (%)

Study 1 174.2 11.8 174.1 11.9

Study 2 52.4 52.7 52.8 52.9


128!128 voxels and the duration of each warp was

recorded.

3. Results

Fig. 4 shows the results of warping both test studies using

the HW algorithm. A target image is shown for comparison

since landmarks were chosen to produce a warped volume

that approximated the target volume. After warping, the MI

between the warped and target volumes was calculated to be

174.2 for Study 1 and 52.4 for Study 2 (Table 3).

The calculated SAD values were 11.8% for Study 1 and

52.7% for Study 2 (Table 3). The similarity metrics were

also computed between volumes warped using the SWG

algorithm and their respective target volumes. These values

are shown in Table 3 for comparison with the HW values.

When Study 1 was warped using both the SW and the

SWG algorithms, a wedge shaped artifact (Fig. 5) was

Fig. 5. A figure illustrating the artifact created by the software (SW and SWG) algo

observed. In order to prevent this artifact from affecting our

results it was removed by clipping 7 voxels from the borders

of all Study 1 volumes. This was not necessary for the Study

2 volumes because no artifacts were created by the software

algorithms.

3.1. Quality comparison of hardware and software

based TPS

Visually comparing volumes warped using the HW

algorithm to those warped using the SW and SWG

algorithms showed almost identical results, even when

viewed under magnification (Fig. 6). However, subtracted

images illustrate the specific differences between the

outputs of the various algorithms. The HW algorithm

subtracted image shows a low intensity background noise

while the SWG image does not exhibit this effect (Fig. 6).

The similarity between the HW algorithm and the SW

algorithm decreased as the grid spacing of the HW

algorithm was increased (Fig. 7). However, the similarity

between the two warping techniques reached 99% of its

maximum MI value at a grid spacing of 16!16 voxels

(Fig. 7). The correspondence between the SWG algorithm

and the SW algorithm also varied inversely with the grid

spacing of the SWG algorithm. Though the correspondence

continued to increase as the grid spacing was decreased,

95% of its maximum MI was reached at a grid spacing of

16!16 voxels (Fig. 7). Based on comparison with the SW

algorithm, the HW algorithm was slightly more accurate

rithms when warping Study 1. The white, dashed box highlights the artifact.

Fig. 6. Images comparing the results of the SWG algorithm and the HW algorithm to the results of the SW algorithm. The second row of images shows a

magnified view of the area within the white, dashed square. The third row shows images created by subtracting each volume from the SW warped volume. The

fourth row shows a magnified view of the subtracted images.

Fig. 7. The calculated similarity between volumes warped using the

software (SWG) and hardware (HW) algorithms and volumes warped using

the SW algorithm, as a function of grid spacing. The two similarity metrics

computed were Mutual Information (higher is better) and SAD (lower is

better).


than the SWG algorithm at larger grid spacing (Fig. 7).

Table 4 shows the calculated similarities between the HW

and SW algorithms as well as the calculated similarities

between the SWG and SW algorithms.

The similarity between volumes warped using the SWG

and HW algorithms and the target volumes from both test

studies showed little change as grid spacing was increased.

This is evidenced by the flat lines on both the SAD and MI

graphs (Fig. 8).

The effects of intensity scaling on the quality of the HW

algorithm are shown in Table 5. Scaling the source volume

to 8-bit before applying the SW algorithm did not cause the

output of the SW algorithm to match that of the HW

algorithm more closely.

Table 4

The similarity of the software (SWG) and hardware (HW) warped studies

when compared to the SW warped studies

Study 1 Study 2

Hardware Software Hardware Software

Matrix 512 256 512 256 512 256 512 256

MI 323.7 324.3 335.9 316.7 312.0 301.9 342.4 316.7

SAD (%) 1.64 1.71 0.22 0.92 1.33 1.72 0.17 0.81

A grid spacing of 16!16 voxels was used for the software and hardware

warps. Values for volumes with 512!512 matrix size and volumes with

256!256 matrix size are shown.

Table 5

The similarity of HW warped volumes to 8-bit and 16-bit volumes warped

using the SW algorithm

Study 1 Study 2

8-bit 16-bit 8-bit 16-bit

MI 322.8 323.7 309.6 312.0

SAD (%) 3.71 1.64 1.51 1.33

Fig. 8. The calculated similarity between volumes warped using the

software (SWG) and the hardware (HW) algorithms and their respective

target volumes as a function of grid spacing. The similarity metrics

computed were Mutual Information (higher is better) and SAD (lower is

better).


The quantitative results of warping the 256!256

volumes with the HW and SWG algorithms are shown in

Table 4. The MI and SAD values calculated between the

results of these algorithms and the results of the SW

Table 6

The warping time, in seconds, of the HW, SWG and SW algorithms for two test

Study 1

HW SWG SW

Matrix 512 256 512 256 512 25

System 1 2.3 0.5 32.9 9.5 292.7 77

System 2 2.6 0.7 22.8 5.8 138.2 32

Times are provided for volumes with 512!512 matrix size and volumes with 256!

16!16 voxels.

algorithm were still much higher than the MI and SAD

values calculated between the warped 512!512 volumes

and their corresponding target volumes (Table 3).

3.2. Speed comparison of hardware and software based TPS

The HW algorithm was substantially (between 7 and 65

times) faster than the SWG algorithm (Table 6, Fig. 9).

The speed differential depended on the number of land-

marks in the warp and the system configuration used to run

both algorithms. The HW algorithm was between 85 and 87

times faster than the SW algorithm. The runtimes of each of

the algorithms exhibited a linear dependency on the number

of landmarks; though for both the SWG algorithm and the

HW algorithm this relationship was fairly flat (Fig. 9).

The exact nature of this relationship was determined by

fitting a line to the data. The HW algorithm showed a

computing time increase of 8–12 ms per landmark. The

SWG algorithm showed an increase of 16–19 ms per

landmark. The software algorithms showed improved

performance on System 2 due to the increased processor

speed but the HW algorithm suffered an overall decrease in

performance due to the slower graphics card of System 2.

The fill rate shows the number of textured pixels (texels) the

graphics card can draw per second (Table 2) and is

therefore, a fitting indicator of performance in this

application.

The runtimes of both the SWG and HW algorithms

increased exponentially as grid spacing was decreased

(Fig. 10). This affect was more pronounced in the HW

algorithm. The faster processor of System 2 helped alleviate

this affect (Fig. 10).

A higher warping speed was obtained by decreasing the

matrix size of the source volumes (Table 6).

4. Discussion

4.1. Quality comparison of hardware and software TPS

Several interesting observations can be made from the

quality measurements performed on the HW algorithm

When the HW and the SWG algorithms were compared to

the SW algorithm, it became evident that at lower grid

spacing the SWG algorithm was a slightly closer match to

studies, on two different sets of hardware

Study 2

HW SWG SW

6 512 256 512 256 512 256

.2 0.6 0.2 23.5 7.0 77.6 19.5

.4 1.6 0.4 15.7 4.0 33.5 8.4

256 matrix size. The grid spacing selected for the HW and SWG warps was

Fig. 9. Runtimes of all three warping algorithms as a function of landmark

number, when applied to Study 1. The results for Study 2 were

proportionally lower than those for Study 1. The SWG algorithm and the

HW algorithm were applied with a grid spacing of 16!16 voxels. The top

graph provides a detailed comparison of the warping times of the HW

algorithm and the SWG algorithm. The bottom graph includes the SW

runtime.

Fig. 10. The effect of grid spacing on the warping time of the software

(SWG) and hardware (HW) algorithms, when applied to Study 1. The

results for Study 2 were proportionally lower than those for Study 1.


the SW algorithm (Fig. 7). However, as grid spacing was

increased (wR64!64), the HW algorithm started to better

approximate the SW algorithm (Fig. 7). This is due to the

explicit triangulation of the HW grid. The added midpoint

vertex of each grid square leads to more warped points at

each grid spacing. This explains the higher accuracy of

the HW algorithm at higher grid spacing. At lower grid

spacing this advantage is not as important because the size

of each grid square is decreased and the effect of the

midpoint vertex becomes negligible.

Both the HW and SWG algorithms reached a point at

which decreasing the grid spacing no longer had a large

impact on the MI or the SAD calculations (Fig. 7).

Decreasing the grid spacing below 16!16 voxels showed

a negligible increase in the correspondence between the SW

algorithm and both the HW and the SWG algorithms.

Visual comparison did not reveal any significant

differences between the outputs of the HW algorithm and

the target volumes (Fig. 4). Furthermore, the HW algorithm

produced results that were nearly indistinguishable from

those produced by both the SW, and the SWG algorithms

(Fig. 6). The subtracted images revealed subtle differences

in the output of all three warping algorithms, but showed

that they all produced the same overall effect (Fig. 6). The

most prominent visual effect was the low intensity

background noise in the subtracted HW image (Fig. 6).

This effect may be due to differences in the software

and hardware interpolation routines or in the manner in

which the TPS transformation (Eq. 5) is evaluated. VTK

uses eigenvector decomposition to evaluate Eq. (5) and this

could return slightly different results than our own Gaussian

Elimination method.

The MI and SAD calculated between the HW algorithm

and the SW algorithm, though lower than the values

calculated between the SWG algorithm and the SW

algorithm (Table 4, Fig. 7), were still very high; especially

when compared to the MI and SAD values calculated

between the warped and target volumes (Table 3, Fig. 8).

The MI and SAD values calculated between the warped and

target volumes were virtually identical, regardless of

whether the SWG algorithm or the HW algorithm was

used to warp the source volumes. This indicates that the

observed quantitative and qualitative differences in the

output of the HW algorithm are insignificant in real world

applications.

4.2. Speed comparison of hardware and software TPS

At grid spacings greater than or equal to 16!16 voxels,

the HW algorithm exhibited a substantial performance

advantage over both software algorithms (Table 6, Fig. 10).

As the grid spacing was decreased, the speed of the HW

algorithm approached that of the SWG algorithm but was

always faster for identical studies warped using identical

hardware. This may be because of the explicit triangulation

of the hardware grid. Due to the extra midpoint vertex in


each grid square, the hardware accelerated TPS grid

contains more vertices than the software grid transform at

equivalent grid spacing. If the software grid contains N2

points than the hardware grid contains (2N2K2NC1)

vertices. The computed sum term from Eq. (8) involves

a distance calculation that must be performed for each

landmark in the warp, at each grid point. This calculation is

still performed in software and therefore, adding more grid

vertices causes the software component of the algorithm to

dominate the computing time. Thus, the advantage of

hardware acceleration is overshadowed and processor speed

becomes the bottleneck. A faster processor helps to alleviate

this problem, as shown by the benchmark results for System

2 (Fig. 10).

Based on the comparisons of the HW warped volumes to

the SW warped volumes, we can see that decreasing the grid

spacing below 16!16 voxels has little effect on the

accuracy of the transformation (Figs. 7 and 8). This allows

us to apply the HW algorithm using a grid spacing

(R16!16 voxels) that favors the hardware component of

the algorithm and is therefore, substantially faster than the

SWG algorithm (Table 6). This suggests that 16!16 voxels

is the optimal grid spacing for studies with a 512!512

matrix size and with sub-millimeter pixel spacing in the x

and y directions.

All the algorithms showed a linear speed increase in

relation to the number of landmarks supplied to the warp

(Fig. 9). The SW algorithm was the slowest and most

affected by the number of landmarks used, while the HW

algorithm was substantially faster than both software

algorithms (Fig. 9). The software algorithms received an

increase in performance from the faster processor of System

2. The HW algorithm also exhibited some dependence on

processor speed. Though the total performance of the HW

algorithm on System 2 was worse than on System 1, the

increase in computing time per landmark was less (Fig. 9).

By decreasing the matrix size of the source volumes, a

speed increase was achieved (Table 6). The calculated

values of MI and SAD for these warps (Table 4) were still

much higher than the same values calculated between the

warped and target volumes of both test studies (Table 3)

indicating that a grid spacing of 16!16 voxels is still

acceptable for volumes of this matrix and pixel size.

4.3. Comparison to other research

Previous research on fast volumetric deformation has

focused on either novel algorithms [9] that have not been

tested for use in medical imaging applications, or the

application of hardware acceleration to a specific problem

area such as neurosurgery [8], without evaluation of image

quality In neither case were the results quantitatively

validated by comparison with more accurate software

algorithms. Specifically, in Ref. [8], only piecewise linear

warping was proposed. In contrast, we utilize the TPS

nonlinear transformation. In addition, those results were

obtained using dedicated, expensive graphics workstations

and not widely available consumer graphics cards. In our

study we have developed a novel, hardware accelerated

implementation of the TPS transformation, which exploits

the latest developments in consumer graphics hardware.

We have validated our approach by using qualitative and

quantitative analysis to demonstrate that our algorithm can

produce results that are nearly identical to those of a

software TPS algorithm. Therefore, our hardware

accelerated algorithm could be used to provide a significant

increase in speed to any algorithm relying on the TPS

transformation or nonlinear warping. One potential

application is to accelerate the compensation for respiratory

motion artifacts in Digital Subtraction Angiography [4].

This idea could be extended for use with CT Angiography

(CTA) [25]. Nonlinear transformations have also been used

to align lung CT acquired during normal respiration with

high quality diagnostic CT acquired during inspiration [2]

and to assess intracranial brain shift [26]. The hardware

accelerated TPS algorithm has the potential to accelerate all

these applications.

Since the TPS transformation requires only a small set of

landmarks to cause a large change in the shape of a volume,

it is well suited for use in iterative registration algorithms.

One could continually translate each landmark until some

similarity metric (MI, SAD) was maximized. The hardware

accelerated TPS would benefit this type of algorithm greatly

because each iteration would achieve an order of magnitude

increase in speed [5]. Therefore, more precise algorithms,

which use more iterations, could be evaluated using our

accelerated warping algorithm.

4.4. Limitations

There is an important logistical consideration that arises

when hardware acceleration is used in the implementation

of any algorithm. Any such implementation becomes

dependent on the type of OpenGL graphics card used. If

this algorithm were to be used as part of a software package,

it would still need to provide redundant software algorithms

for use whenever such hardware was not available.

Examples include most laptops and lower-end desktop

machines.

Due to hardware limitations, the output of the hardware

accelerated warp contains only 256 discrete values. Though

these values are rescaled to reside between the maximum

and minimum values of the original CT, this rescaling step

introduces a reduction in the intensity resolution. This was

investigated as a cause of the differences between the HW

algorithm and both software algorithms. Our experiments

revealed that this scaling did not significantly affect the

quantitative similarity measures. Scaling the source

volumes and then applying the SW algorithm did not

produce a warped volume that was more similar to the

output of the HW algorithm (Table 5). Furthermore, since

the calculated similarity between the output of the HW


algorithm and the target volumes was nearly identical to

the similarity calculated between the output of the SWG

algorithm and the target volumes (Table 3), one can

conclude that intensity scaling does not significantly affect

the output of the HW algorithm.

One final limitation of current consumer graphics

hardware is the lack of hardware accelerated tricubic

filtering. Visual inspection shows that trilinear interpolation

produces acceptable results for image registration, but

future graphics cards may support this feature, removing

this limitation entirely.

5. Conclusion

We have shown that TPS volume warping can be

significantly accelerated using the latest advancements in

consumer graphics hardware. Furthermore, we have shown,

both quantitatively and qualitatively, that the results of this

hardware accelerated TPS algorithm are of comparable

quality to those produced by two different software

algorithms. This generic method will allow dramatic

acceleration of a wide variety of medical imaging

algorithms that utilize 3D image warping. Consequently, it

could be used to implement more accurate image

registration methods that are currently time-prohibitive.

6. Summary

We have exploited the latest consumer 3D graphics

hardware in order to implement a high performance Thin

Plate Spline (TPS) nonlinear volume warping algorithm.

Accelerated nonlinear warping could be used in iterative

image registration algorithms or in interactive clinical

applications. Hardware accelerated 3D texturing and

trilinear interpolation were combined with OpenGL

vertex programs to implement a modified grid transform

algorithm. This algorithm was used to apply the TPS

transformation to volume data using the ATI Radeon

9700 Pro and Nvidia Geforce FX 5600 graphics cards.

The hardware accelerated TPS algorithm was compared

to a public domain software implementation (the

Visualization Toolkit (VTK) version 4.2) in terms of

speed and quality. Mutual Information (MI) and Sum of

Absolute Difference (SAD) were employed, along with

qualitative observation, to gauge the accuracy of the

hardware accelerated TPS algorithm. The hardware

accelerated TPS algorithm was between 85 and 87

times faster than the software TPS algorithm and 7–65

times faster than an optimized software TPS algorithm.

The hardware accelerated TPS algorithm warped a 512!512!173 lung Computed Tomography (CT) dataset in

2.3 s, compared to 32.9 s for the optimized software TPS

algorithm. Ninety-two landmarks were used to compute

the TPS transformation. For this warp, the calculated MI

between the output of the hardware accelerated algorithm

and the output of the VTK software implementation was

323.7 and the SAD was 1.64%. An MI value of 335.9

and an SAD value of 0.22% were calculated between the

outputs of the optimized software algorithm and the VTK

software implementation. The same experiment was

repeated with the dataset scaled to 256!256!173.

This yielded warping times of 0.5 s for the hardware

accelerated algorithm, versus 9.5 s for the optimized

software algorithm. The MI between the output of the

hardware accelerated algorithm and that of the VTK

implementation was 324.3 and the SAD was 1.71%. The

MI calculated between the output of the optimized

software algorithm and that of the VTK implementation

was 316.7 and the SAD was 0.92%. Additional results,

using various numbers of landmarks, detail settings and

system configurations, are presented within the paper.

Our experiments showed that nonlinear warping can be

significantly accelerated using consumer 3D graphics

hardware and the results of such an implementation are

of comparable quality to those produced by an accepted

software method.

References

[1] Pluim JP, Maintz JB, Viergever MA. Mutual-information-based

registration of medical images: a survey. IEEE Trans Med Imaging

2003;22(8):986–1004.

[2] Slomka PJ, Dey D, Przetak C, Aladl UE, Baum RP. Automated 3-

dimensional registration of stand-alone (18)F-FDG whole-body PET

with CT. J Nucl Med 2003;44(7):1156–67.

[3] Fei B, Kemper C, Wilson DL. A comparative study of warping and

rigid body registration for the prostate and pelvic MR volumes.

Comput Med Imaging Graph 2003;27(4):267–81.

[4] Bentoutou Y, Taleb N, Chikr El Mezouar M, Taleb M, Jetto L. An

invariant approach for image registration in digital subtraction

angiography. Pattern Recognit 2002;35(12):2853–65.

[5] Mattes D, Haynor DR, Vesselle H, Lewellen TK, Eubank W. PET-CT

image registration in the chest using free-form deformations. IEEE

Trans Med Imaging 2003;22(1):120–8.

[6] Spitzer J. Nvidia OpenGL Performance FAQ. www.nvidia.com; 2002.

Nvidia. 9-12-2003.

[7] ATI. Radeon 9500/9600/9700/9800 OpenGL Programming and

Optimization Guide. www.ati.com; 2004. ATI. 2-2-2004.

[8] Rezk-Salama C, Hastreiter P, Greiner G, Ertl T. Non-linear

registration of pre-and intraoperative volume data based on piecewise

linear transformations. Proceedings of Erlangen workshop of vision,

modelling, and visualization (VMV); University of Erlangen; 1999, p.

365–72.

[9] Rezk-Salama C, Scheuering M, Soza G, Greiner G. Fast volumetric

deformation on general purpose hardware. Proceedings of the ACM

SIGGRAPH/EUROGRAPHICS workshop on graphics hardware. Los

Angeles, CA: ACM Press; 2001 p. 17–24.

[10] Lal Shimpi A. 3DLabs’ P10 Visual processing unit—when a CPU and

GPU collide. www.anandtech.com; 2002. AnandTech. 3-20-2004.

[11] Proudfoot K, Mark WR, Svetoslav T, Hanrahan P. A real-time

procedural shading system for programmable graphics hardware.

Proceedings of the 28th annual conference on computer graphics and

interactive techniques. Los Angeles, CA: ACM Press; 2001 p.

159–70.

http://www.nvidia.com

http://www.ati.com

http://www.anandtech.com


[12] Lindholm E, Kligard MJ, Moreton H. A user-programmable vertex

engine. Proceedings of the 28th annual conference on computer

graphics and interactive techniques. Los Angeles, CA: ACM Press;

2001 p. 159–70.

[13] Microsoft. DirectX Programming Guide. www.microsoft.com; 2003.

Microsoft. 2-13-2004.

[14] Segal M, Akeley K. The OpenGL graphics system: a specification

(Version 1.5). www.opengl.org; 2003. Silicon Graphics Inc.

9-27-2003.

[15] Bookstein FL. Principal warps: thin plate spline and the decompo-

sition of deformations. IEEE Trans Pattern Anal 1989;11(6):567–85.

[16] Gobbi DG, Peters TM. Generalized 3D nonlinear transformations for

medical imaging: an object-oriented implementation in VTK. Comput

Med Imaging Graph 2003;27(4):255–65.

[17] Press WH, Teukolsky SA, Vetterling WT, Flannery BP. Numerical

recipes in C, 2nd ed. Cambridge: Press Syndicate of the University of

Cambridge; 2002 p. 36.

[18] Foley JD, van Dam A, Feiner SK, Hughes JF. Computer graphics:

principles and practice, 2nd ed. New York, NY, USA: Addison-

Wesley; 1997 p. 741.

[19] OpenGL Architecture Review Board (ARB). The openGL reference

manual, 3rd ed. New York: Addison-Wesley; 1999.

[20] Schroeder W, Martin K, Lorensen B. The visualization toolkit an

object-oriented approach to 3D graphics, 3rd ed.: Kitware Inc.; 2004.

[21] Slomka PJ, Hurwitz GA, Clement G, Stephenson J. Three-

dimensional demarcation of perfusion zones corresponding to specific

coronary arteries: application for automated interpretation of

myocardial SPECT. J Nucl Med 1995;36(11):2120–6.

[22] Hoh CK, Dahlbom M, Harris G, Choi Y, Hawkins RA,

Phelps ME, et al. Automated iterative three-dimensional regis-

tration of positron emission tomography images. J Nucl Med 1993;

34(11):2009–18.

[23] Maes F, Collignon A, Vandermeulen D, Marchal G, Suetens P.

Multimodality image registration by maximization of mutual

information. IEEE Trans Med Imaging 1997;16(2):187–98.

[24] Slomka PJ. Software approach to merging molecular with anatomic

information. J Nucl Med 2004;45(1):36–45.

[25] Jayakrishnan VK, White PM, Aitken D, Crane P, McMahon AD,

Teasdale EM. Subtraction helical CT angiography of intra-and

extracranial vessels: technical considerations and preliminary experi-

ence. Am J Neuroradiol 2003;24(3):451–5.

[26] Trantakis C, Tittgemeyer M, Schneider JP, Lindner D, Winkler D,

Strauss G, et al. Investigation of time-dependency of intracranial brain

shift and its relation to the extent of tumor removal using intra-

operative MRI. Neurol Res 2003;25(1):9–12.

[27] Justice B. 9700 Pro vs. 9800 Pro. www.hardocp.com; 2003.

HardOCP. 3-30-2004.

[28] Pelletier S. GeForceFX 5200 and 5600. www.hardocp.com; 2003.

HardOCP. 3-30-2004.

David Levin received his BSc in Computer Science and Biology from

the University of Western Ontario, Canada in 2002 and has begun

studying for his MSc in Medical Biophysics at the Robarts Research

Institute. His main interests are interactive 3D graphics, image

processing and analysis.

Damini Dey received her BSc Honours in Physics at the University of

Saskatchewan, Canada in 1988, and her MSc and PhD in Medical

Physics at the University of Calgary, Canada in 1992 and 1998. She did

her Post-Doctoral training at the Imaging Research Laboratories,

Robarts Research Institute, University of Western Ontario, Canada in

image-guided neurosurgery. She is currently a Research Scientist at the

Cedars Sinai Medical Center, Los Angeles, CA. Her research areas are

in-vivo plaque imaging and quantification, and image registration and

fusion.

Piotr J. Slomka received his MASc in Computer Engineering from the

Warsaw University of Technology, Poland, in 1989 and his PhD in

Medical Physics from the University of Western Ontario, Canada in

1995. He is currently a faculty scientist with the Cedars Sinai Medical

Center, Los Angeles, CA and is an Associate Professor of Medicine at

the University of California, Los Angeles. His principal research areas

are image registration, fusion, and automated medical image analysis.

http://www.microsoft.com

http://www.opengl.org

http://www.hardocp.com

http://www.hardocp.com

Documents

Acceleration of 3D, nonlinear warping using standard video graphics hardware: implementation and initial validation