Diploma Thesisstarba.se/Thesis_Sebastian_Schaetz.pdfDiploma Thesis Hybrid implementation of an iterative reconstruction algorithm for Positron Emission Tomography sinograms ... September

Diploma Thesis

Hybrid implementation of an

iterative reconstruction algorithm for

Positron Emission Tomography sinograms

on CPUs and GPUs

submitted to University of Applied Sciences Regensburg,

Department of Computer Science and Mathematics

by Sebastian Schaetz

Supervisors

Prof. Dr. Markus Kucera

Dr. Frank Kehren

September 26, 2009

To my parents

Abstract

Positron emission tomography (PET) is a medical imaging modality that al-

lows the detection of tissue with particular properties or the observation of

metabolic processes in living organisms over time. It plays a vital role in can-

cer diagnosis and heart examination and represents a promising method for

early diagnosis of dementia. With the advancement of scanner technology as

well as an ever increasing demand for higher quality images and higher pa-

tient throughput the image reconstruction system can become a bottleneck

of a PET system. To meet this demand for more computing power in the

reconstruction system a novel way to speed up the image reconstruction pro-

cess for modern PET systems is presented. A common graphics processing

device in conjunction with the CUDA framework is used to speed up exist-

ing algorithms for the CPU. Projectors for full 3D reconstruction are ported

to the CUDA device and new algorithms are implemented where necessary.

The main focus of this port is to speed up the calculation and to maintain

the numerical accuracy of the result. On the basis of the CPU reconstruction

system a new system that uses GPU projectors is implemented. With this

modification a two-fold speed up is achieved compared to the highly opti-

mized CPU implementation running on 4 CPU cores. Different optimization

methods are explored and applied where suitable. The results calculated by

GPU and CPU projectors are numerically identical allowing them to be in-

terchanged. The system is validated and the algorithms are integrated into

the reconstruction system of Siemens Healthcare Molecular Imaging PET

scanners.

Contents

1 Introduction 9

1.1 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.2 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Positron emission tomography 16

2.1 Physics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1.1 Radiation and Radioactive Decay . . . . . . . . . . . . 16

2.1.2 Positron Decay . . . . . . . . . . . . . . . . . . . . . . 17

2.1.3 Positron Annihilation . . . . . . . . . . . . . . . . . . 18

2.1.4 Radiation Detection and Scintillation . . . . . . . . . 19

2.1.5 Positron Emitter Production . . . . . . . . . . . . . . 21

2.2 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2.1 Scintillation Detectors . . . . . . . . . . . . . . . . . . 23

2.2.2 Detected Events . . . . . . . . . . . . . . . . . . . . . 23

2.2.3 2D and 3D Data Acquisition . . . . . . . . . . . . . . 26

2.2.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . 27

2.3 Coordinate System, Data Structures and

Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3.1 Coordinate System . . . . . . . . . . . . . . . . . . . . 28

2.3.2 Parallel Beam Space and Line of Response Space Sino-

grams . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.3.3 Sinogram Compression . . . . . . . . . . . . . . . . . . 30

2.3.4 Other Data Structures . . . . . . . . . . . . . . . . . . 34

2.4 Image Reconstruction . . . . . . . . . . . . . . . . . . . . . . 34

2.4.1 Basic Principles . . . . . . . . . . . . . . . . . . . . . . 36

2.4.2 The Maximum-Likelihood Expectation Maximization

Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 37

7

3 Hybrid implementation of PET image reconstruction 43

3.1 Projectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.1.1 Ray projection through pixel images . . . . . . . . . . 43

3.1.2 Projector Algorithm . . . . . . . . . . . . . . . . . . . 45

3.1.3 Implementation . . . . . . . . . . . . . . . . . . . . . . 46

3.2 Optimization of CPU projector . . . . . . . . . . . . . . . . . 48

3.2.1 Analysis of current Implementation . . . . . . . . . . . 48

3.2.2 Optimization of current Implementation . . . . . . . . 49

3.3 The Compute Unified Device Architecture . . . . . . . . . . . 51

3.3.1 Hardware Architecture . . . . . . . . . . . . . . . . . . 52

3.3.2 Programming Model . . . . . . . . . . . . . . . . . . . 54

3.3.3 Execution Model . . . . . . . . . . . . . . . . . . . . . 57

3.3.4 Memory Model and Access . . . . . . . . . . . . . . . 58

3.3.5 CUDA Toolchain . . . . . . . . . . . . . . . . . . . . . 59

3.3.6 Debugging CUDA code . . . . . . . . . . . . . . . . . 60

3.4 Implementing projectors in CUDA . . . . . . . . . . . . . . . 61

3.4.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . 61

3.4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . 62

3.4.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . 66

3.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 70

3.4.5 Other algorithms . . . . . . . . . . . . . . . . . . . . . 72

3.4.6 Texture Units . . . . . . . . . . . . . . . . . . . . . . . 75

3.5 Debugging and Validation . . . . . . . . . . . . . . . . . . . . 76

3.5.1 Debugging . . . . . . . . . . . . . . . . . . . . . . . . . 76

3.5.2 Validation . . . . . . . . . . . . . . . . . . . . . . . . . 77

3.6 Product considerations . . . . . . . . . . . . . . . . . . . . . . 81

3.6.1 Timing the Tests . . . . . . . . . . . . . . . . . . . . . 84

3.6.2 Memory Tests . . . . . . . . . . . . . . . . . . . . . . . 85

3.6.3 Reconstruction Tests . . . . . . . . . . . . . . . . . . . 88

3.7 Product Integration . . . . . . . . . . . . . . . . . . . . . . . 93

3.7.1 Integration Overview . . . . . . . . . . . . . . . . . . . 94

3.7.2 Multithreading Implementation . . . . . . . . . . . . . 95

3.7.3 Parallel CPU and GPU projection . . . . . . . . . . . 101

3.7.4 Hybrid Implementation . . . . . . . . . . . . . . . . . 101

4 Conclusion 103

8

Chapter 1

Introduction

Positron Emission Tomography (PET) is a tomographic imaging technology

which utilizes the positron emission decay of radioisotopes. PET allows the

detection of tissue with particular properties or the observation of metabolic

processes in living organisms over time.

The subject about to undergo a PET scan is injected a substance com-

mon to its body that is known to accumulate in the region of interest or

interact in the metabolic process that is to be observed. The substance

called tracer is prepared to include positron emitting nuclei. The nuclei are

carefully chosen for an optimal balance between patient safety and imag-

ing effectiveness. After injection the subject’s body metabolizes the tracer.

The tracer emits positrons at a rate specific to the chosen radioactive nu-

cleus. The subject is placed inside the PET scanner where the radioactive

emissions can be measured very precisely. The PET scanner is able to de-

tect single positron emissions. The subject remains in the scanner a certain

amount of time and the tomograph continues to measure positron emissions.

After the scan is completed complex algorithms are applied to the acquired

data to calculate 3D images of the measurements. The images then show in

which regions of the subject’s body the tracer accumulates the most or how

it is transported and distributed over time.

A common application for this scanning modality is cancer staging and

examination following cancer treatment. Many types of cancer cells have an

unusual high metabolism thus requiring disproportional amounts of sugar.

To detect these cancer cells, a positron emitter is attached to sugar molecules

and injected into the patient’s body. Cancerous tissue takes up more of the

9

tracer than healthy tissue. In the resulting images cancer cells show more

positron emissions than healthy cells.

The current application range of Positron Emission Tomography can be

divided into medical applications and research applications. In the area of

research PET is used in brain studies to explore human brain functions. In

pre-clinical studies (before testing in humans) specific animal PET devices

are very common. PET technology is particularly valuable in cancer research

with animals as it allows repeated probing of the same subject without killing

it, thus substantially reducing the numbers of animals required.

The three important areas in clinical usage are oncology (cancer diagno-

sis and treatment), cardiology (treatment of heart disorders) and neurology

(treatment of disorders of the nervous system). PET applications for cancer

diagnosis and treatment include the detection of tumors, the grading of ma-

lignancy by studying the uptake of metabolic tracers, the documentation of

how widespread the cancer is in a patient, the testing for returning cancer

after treatment and the evaluation of effectiveness of cancer therapy. In

cardiology the applications include the preliminary examination of patients

before heart transplantation and the diagnosis and assessment of severity of

coronary artery disease. In neurology PET is for example used for the man-

agement of brain tumors. Furthermore PET shows to be a superior method

for early diagnosis of dementias [16]. This could become a very important

application as soon as effective treatments are available.

PET system manufacturers try to increase the performance of their sys-

tem with every generation of new scanners. The performance of PET sys-

tems for clinical use can in general be measured by the quality of the gen-

erated images and the number of patients that can be scanned in a given

time frame. The image quality is optimized by using better, more sensitive

scanners capable of higher resolutions that are able to measure the emitted

positrons in a very exact way. This results in newer scanners producing data

sets that are multiple times larger than data from earlier scanner genera-

tions. An example for this is the switch from 2D data acquisition to 3D.

Additionally physicist are modeling the data acquisition process more de-

tailed than ever allowing for very powerful image reconstruction techniques

to remove noise and compensate for inaccurate measurement. Examples for

this are attenuation correction where the densities of the patient’s body are

taken into account during image calculation and PSF reconstruction where

10

the point spread function of a scanner is considered.

The measured datasets get larger and larger and the image calculation or

reconstruction process gets more complex and accurate. At the same time

newer scanners are expected to have a higher patient throughput. This is

an important factor for clinical use as it determines how many patients can

undergo the possibly life-saving diagnosis. The number of patients scanned

in a given time frame depends on a lot of factors including scanner sensitivity,

the tracer and the clinical workflow.

For image reconstruction the workflow dictates one rule-of-thumb stan-

dard: as soon as one scan is completed, the image of the last scan should

be ready and available. And this should be true not only for scans from

different patients where there might be a longer time period between scans

but also for whole-body scans. During a whole body scan one part of the

body is scanned for a period of time and then the patient is moved so that

another part of his body can be scanned. Assuming a scanning duration of

5 minutes for one part of the body, the image reconstruction system has at

most 5 minutes to calculate an image.

This requirement in conjunction with the growing data sizes and the

more complex reconstruction techniques increases the demand for better per-

formance of the reconstruction system. Current implementations are highly

optimized but reach the computational limits of modern high-performance

computer systems. It is therefore sensible to look for new ways to speed up

the reconstruction process.

The enormous computational performance of graphics devices for gaming

or demanding workstation tasks has long been known and for a while now

academy and industry has tried to leverage the potential of these devices

for general purpose computation. With the release of several tools that

simplify the process of porting algorithms to the GPU such as BrookGPU,

CUDA and CTM (now Stream SDK) in the last 2 years general purpose

computation on graphics processing units has experienced a wide-spread

adoption in research communities as well as industry.

In this work existing algorithms for PET image reconstruction are ported

to graphics processing units using the CUDA framework. They are modi-

fied to fit the programming paradigms of GPGPU and new algorithms are

implemented where necessary. A number of optimization techniques are in-

vestigated and applied to the implementation. The algorithms are tested

11

for performance and accuracy as the goal is to speed up the reconstruction

whereas the calculated results should not differ from the original implemen-

tation. After successful implementation, optimization and validation the

algorithms are integrated into the reconstruction system of Siemens Medi-

cal PET scanners.

1.1 Previous work

Over the last years a substantial amount of research has gone into speeding

up image reconstruction for positron emission tomography as well as other

imaging modalities such as single positron emission computed tomography

(SPECT) and computed tomography (CT). Scientists and developers came

up with many ideas to keep up with the ever increasing demand for image

quality and reconstruction speed.

In the 1990s researchers tried speeding up reconstruction using powerful

processing clusters such as the Intel iPSC/2 or the BBN Butterfly [11].

Atkins et al. used a Transputer with T800 processors in a master-worker

architecture [3] and Jones at al. developed dedicated VLSI hardware for

image reconstruction [25]. In 1995 the successor of the Intel iPSC/2, the

Intel iPSC/860, a cluster with 128 nodes in a hypercube topology was used

by Johnson et al. [23].

One important improvement of PET image reconstruction was the in-

vention of the OSEM algorithm by Hudson and Larkin [18] that increased

the performance of iterative image reconstruction algorithms by an order

of magnitude and thus permitted the use of this superior reconstruction

method in clinical environments for the first time. Based on this algorithm

many researchers achieved even greater speedups by using different hard-

ware.

A lot of research focused on the use of parallel processors and larger

computational clusters. In 2001 Vollmar et al. presented impressive results

with an implementation running on a 7 node 4 processors per node Intel

Xeon system [52]. In 2002 Jones at al. [24] used a single program multi-

ple data implementation of the OSEM algorithm running on a cluster of 9

nodes each containing two Intel Xeon processors interlinked via Gigabit Eth-

ernet. The group took two different approaches towards decomposing the

problem domain: projection space decomposition and image space decom-

12

position. The image space decomposition idea is very similar to the CPU

implementation enhancements presented in this work. The main problem

with these approaches is the costly computational cluster that is required for

reconstruction. Due to their size and infrastructure requirements they are

often not suitable for use in clinical environments. Additionally in todays

environment-aware economy the amount of power the reconstruction system

requires is also a factor.

Nevertheless in 2006 Jones et al. published interesting results achieved

with a parallel OSEM implementation using both message passing (MPI)

and shared memory programming (OpenMP). Their algorithm spread all

necessary projections of one subset not only across multiple computational

nodes but also across processor cores inside one node. With this tech-

nique the communication overhead between nodes decreased as multiple

cores could share memory which resulted in an adequate speedup. However,

the communication overhead was still high, as the algorithm required all

nodes to synchronize and exchange all their datasets after the calculation

of one subset to calculate the correction factor and finally the image, which

is used for the next subset projection. They achieved a parallel efficiency

of about 0.5 when utilizing 64 processors and were thus able to speed up

reconstruction time by a factor of 30.

Although the components in such a cluster can be cheap commercial off-

the-shelf parts with the number of processors the costs of the reconstruction

system increases especially considering not only the processors but also the

node interconnects. In addition, a computational cluster with 64 or more

nodes consumes a lot more power than a normal system would and requires

a lot more space and possibly special cooling. Also the mean time between

failures is much higher in a distributed system as it contains more com-

ponents which results in a higher failure rate. Thus such large computer

systems are often not practical for products designed for clinical use. For

clinical applications the challenge is to find a compromise between cost and

performance of the reconstruction system.

In search of better solutions researchers came up with the idea of using

graphic processing units for image reconstruction. One of the first works

yielding impressive results early on was presented in 1994 by Cabral et

al. [9]. They showed volume rendering and tomographic reconstruction

algorithms for computed tomography utilizing an SGI Onyx Reality Engine

13

processor. They implemented the filtered backprojection algorithm for CT

and a volume rendering algorithm which they found to be very similar in

nature on the graphics processing unit.

About 10 years later in 2005 Wang et al. [54] presented an OSEM im-

plementation for quantitative single photon emission tomography (SPECT)

for graphics cards including attenuation correction and adjustments for

the characteristic point spread functions of the scanner. Using a Nvidia

GeforceFX 5900 graphics card with 256MB video memory they achieved

an impressive 10 fold speedup. Their work represents a very cost effective

solution for quantitative SPECT in clinical environments.

In 2006 Bai et al. [4] presented image reconstruction for PET using

graphics hardware. They used very similar hardware as the GPUs used

in this work however it was no G80 core yet and at that time the CUDA

framework was not available yet. This constrained the developers to a cer-

tain extent as random writing to graphics memory was not possible and the

OpenGL API and Cg had to be used. They utilized the bilinear interpo-

lation routines of the hardware texturing unit and were able to calculate

4 LORs at a time utilizing the RGBA values of a pixel. Although their

system could not outperform a highly optimized CPU implementation, the

GPU represented a cheaper alternative to CPUs.

In 2007 Panin and Kehren filed for patent [43] for an acceleration of

Joseph’s method for full 3D reconstruction. In the patent a method is de-

scribed to speed up the calculation of projection rays through pixel images by

calculating the line integral with linear interpolation. The speedup achieved

with method described in the patent results form reusing intermediary re-

sults that are already calculated for the oblique sinogram segments. The

projector algorithms presented in this work are based upon this

In 2007 and 2008 two groups did similar work compared to [43] as they

took another close look at the OSEM reconstruction algorithm and in par-

ticular at the projectors and found that they can be implemented more ef-

ficiently. Kadrmas [27] found that the projection can be formulated as two

separate operations, namely image rotation and image slanting. Kadrmas

also found that by using these operations, redundant parts of the projection

algorithm can be omitted. This greatly improved the performance of the

algorithm. Additionally depth compression along the z axis was applied to

further reduce the computational complexity. Also the idea to reshuffle the

14

indices to efficiently access the memory was explained.

Similarly Hong et al. [17] also presented an algorithm for the CPU that

exploits symmetries in the projections. They achieve a 70 fold speed-up

when comparing their implementation to an older reconstruction system by

Siemens with 8 computational nodes while maintaining a good numerical

accuracy. Amongst other things they exploited SEE instructions of modern

Intel CPUs to calculate 4 symmetric elements in one step.

Finally in 2007 Scherl et al. [47] presented a fast algorithm for the recon-

struction of computed tomography sinograms on the GPU which used the

CUDA architecture. Although the OSEM algorithm is iterative, CT recon-

struction and PET reconstruction share the same basic operation: projec-

tion. Scherl et al. were able to achieve impressive results utilizing the same

hardware that was used in this thesis. They however used the texturing

unit of the device to calculate the bilinear interpolations. The group com-

pared their work to an implementation on a CELL processor and ascertained

similar performance.

1.2 Acknowledgement

The author thanks his supervisors Markus Kucera for guiding him through

the writing process of this thesis and Frank Kehren for his patience, excellent

advice and for giving him the opportunity to work on such an interesting

topic. Above that the author would like to thank the entire reconstruction

group from Siemens Healthcare Molecular Imaging, especially Zigang Wang,

Jicun Hu and Chuanyu Zhou, as well as Swetha Nandyala, Keith Clark

and Bernd Schaarschmidt. The author very much enjoyed working with the

group and is grateful for their ideas and suggestions as well as their support.

Special acknowledgment is due to Herbert Kopp and Manfred Kraus who

made this thesis possible in the first place.

15

Chapter 2

Positron emission

tomography

2.1 Physics

In the following chapter the physical principles behind PET are explained.

Three physical phenomena lay the foundation for PET and are of great

importance: positron decay, positron annihilation and scintillation. The

following explanations are based on the Rutherford-Bohr model ([6] and [46])

of the atom as a theoretical model. It makes the following assumptions: An

atom consists of a nucleus and electrons. The nucleus contains Z protons and

N neutrons and their sum is the atom’s mass number A = Z + N . Protons

contribute the positive charge to the atom. Electrons are positioned in

energy levels or shells that surround the nucleus and contribute the negative

charge. The common nomenclature used to denote the configuration of a

specific nucleus is ZAX.

2.1.1 Radiation and Radioactive Decay

Radiation in general is energy in the form of particles or waves. Depend-

ing on its effect on matter it can be divided in ionizing and non-ionizing

radiation. Examples for non ionizing radiation are light or radio waves. Of

particular interest for nuclear medicine and radiological imaging is the en-

ergetic ionizing radiation. Ionizing radiation is emitted during radioactive

decay, a process by which an unstable parent nucleus decays into a more

stable descendant nucleus [45] by emitting radiation, ionizing particles or

16

both. The gradient of decay of N atoms can be statistically approximated

by the first order differential equation for exponential decay:

dN

dt= −λN (2.1)

The solution for 2.1 describes the number of unstable atoms that are left

after a certain amount of time t with N0 depicting the number of atoms at

time t = 0 and λ as the decay constant:

N (t) = N0e−λt (2.2)

The half-life t 1

2

= ln2λ characterizes the decay rate of unstable atoms by spec-

ifying the time required for half of the unstable atoms to decay. Since decay

events happen autonomously and continuously the process of radioactive

decay is very similar to a Poisson process. Therefore a Poisson distribution

can be assumed when creating a mathematical model for nuclear decay:

P(k , λ) = e−λ λk

k !(2.3)

Equation 2.3 describes the probability that within a specific time interval k

decay events occur where the expected number of events during this time

interval is λ. For PET imaging a specific type of radioactive decay is of

particular interest: positron decay.

2.1.2 Positron Decay

Positron decay happens in proton rich, unstable atoms. They achieve sta-

bility by converting a proton to a neutron whereas the positive charge is

emitted from the nucleus in the form of a positron as in:

11p

+ −−→ 01n + 1

0β+ + ν. (2.4)

ν is a neutrino that accompanies the positron. Positron decay is in some

sense the opposite of β-decay where an electron is emitted. The general

equation for positron decay from an atom is:

AZX −−→ A

Z−1Y + 01β

+ + ν. (2.5)

17

To balance the atoms charge after decay the resulting nucleus has to eject

one of its orbital electrons. A process called internal conversion is often

responsible for this. The electron is ejected with kinetic energy specific to

the atom. The emitted positron also has kinetic energy and both electron

and positron therefore travel a finite distance outside the nucleus. The range

depends on their kinetic energy and the medium they travel in.

Nuclide t1

2

Emax Emode Range in H2O(mm)

(mins) (MeV) (MeV) Max Mean11C 20.4 0.959 0.326 4.2 1.113N 9.96 1.197 0.432 5.1 1.515O 2.03 1.738 0.696 7.3 2.518F 109.8 0.633 0.202 2.4 0.668Ga 68.3 1.898 0.783 8.2 2.982Rb 1.25 3.40 1.385 14.1 5.9

Table 2.1: Properties of positron-emitting atoms (reproduced from [5])

Table 2.1 shows the properties of chosen positron emitting nuclei. The

properties include the half-life t 1

2

of the nucleus, the maximum and most

likely energy of the emitted positrons Emax and Emode and the maximum

and mean range in water. The energies are roughly proportional to the

distances a positron travels. These properties are important when selecting

a suitable nuclide for PET. The location of the positron decay event has to

be measured as exact as possible but the further away the positron travels

from the nucleus, the greater the uncertainty about where the event actually

occurred.

As a result of emitting a positron and an electron, the atom is at least

2 electron masses lighter than before. There’s also a chance that a proton

rich atom does not achieve stability by positron decay but by capturing an

electron.

2.1.3 Positron Annihilation

After the positron is emitted from the nucleus it has kinetic energy so it

travels a certain distance away from the nucleus until its kinetic energy is

close to zero. When the positron then collides with an electron the two

particles annihilate and leave electromagnetic radiation. The most probable

18

form of radiation that is emitted as a result of this collision is of two photons

of 511 keV. With high probability two and with less than 1% probability

three photons are emitted. If the kinetic energy of the colliding particles

is zero the photons are emitted at 180◦ to each other. In many cases the

momentum of the two particles is not precisely zero before annihilation so

photon pairs are not always emitted strictly at 180◦.

Figure 2.1: Decay and Annihilation exemplified by 189F

Figure 2.1 illustrates positron decay and annihilation exemplified by the

Fluorine isotope 189F, one of the most used nuclei for PET today.

2.1.4 Radiation Detection and Scintillation

A couple of different techniques exist to detect radiation: ionization cham-

bers, Geiger-Mueller tubes, semiconductor radiation detectors and liquid

and solid scintillation detectors. Although there has been research towards

semiconductor detectors for medical imaging using the Schottky CdTe de-

tector diode [14] and CdZnTe detectors [15] solid scintillation detectors are

the predominant technology in PET. For SPECT however usage of semicon-

ductor radiation detectors is on the rise.

A solid scintillator is a transparent crystal that has the ability to emit

light when being hit by γ rays. For this principle the assumed model of

19

the atom has to be expanded. Electron energy states are limited to discrete

energy levels [50]. A valence band is the highest energy band of a discrete

energy level of an atom with electrons present. The following first unfilled

band is called the conduction band. The energy hole Eg between valence and

conduction band is a few electron volts large [5]. The electrons in the highest

energy band can absorb energy from hitting γ rays and get excited. They

move into the conduction band and while releasing scintillation photons they

de-excite instantly. The emitted light is usually in the ultraviolet spectrum

however by incorporating impurities into the scintillation crystal the band

structure can be manipulated so that the crystals emit visible light. This

light can be captured by photomultiplier tubes that are thus able to convert

the measured photons into an electric signal. This process is also called

luminescence. The design of scintillation detectors is described in 2.2.1.

Property NaL(Tl) BGO LSO YSO GSO BaF2

Density ρ (g/cm3) 3.67 7.13 7.4 4.53 6.71 4.89Effective Z (Zeff ) 50.5 74.2 65.5 34.2 58.6 52.2Decay constant (ns) 230 300 40 70 60 0.6Output (photons/keV) 38 6 29 46 10 2△E/E (%) 5.8 3.1 9.1 7.5 4.6 4.3

Table 2.2: Properties of commonly used scintillators (reproduced from [5])

Three properties of the scintillator influence the quality of the mea-

surement and thus affect system design and achievable image quality. The

scintillator crystal has to be able to stop the 511keV photons because if a

photon does not deposit its energy into the crystal the γ ray is not counted.

A crystal therefore has to be able to effectively stop 511keV photons. The

density ρ of the scintillator and the effective atomic number Zeff determine

how effective a crystal is in stopping 511keV photons. Secondly the sig-

nal decay time is important. Once a proton hits the scintillator, the event

should be detected and processed immediately so that the detector is ready

to measure new events as they hit. This is quantified by the decay constant.

The light output of the scintillator is a third important property. The higher

the light output the less sensitive the photo detectors have to be resulting in

a high ratio of number of crystals to number of photo detectors (see 2.2.1).

The light output also determines the energy resolution △E/E which affects

the detectors ability to recognize and reject scattered events. Scattered

20

events contribute noise and originate from photons that are deflected and

lose some of their energy. Table 2.2 lists some commonly used scintillators

in PET and their properties. Vendors use different types of scintillators

however the most common today are LSO (Siemens) a combination of LSO

and YSO [29] (Phillips) and BGO [49] (General Electric).

2.1.5 Positron Emitter Production

The positron emitting isotopes are created with the help of a cyclotron.

The halogens (elements from group 17 of the periodic table) are particularly

suitable positron emitters in PET imaging, especially fluorine. The unsta-

ble halogens are produced with the help of a cyclotron, a form of particle

accelerator that is capable of producing fast protons. In the case of 18F fluor

production 18O oxygen in highly enriched water is bombarded with very fast

protons produced in a cyclotron.

S

B

D1 D

2

High Frequency

Oscillator

High Speed

Protons

Figure 2.2: Functional diagram of cyclotron

Figure 2.2 illustrates the functional principle of a cyclotron. Protons

emit from the center S of the system and are accelerated by the difference

of potential in the gap between electrodes D1 and D2. The difference in

potential is generated with a high frequency oscillator connected to the two

electrodes. Particles experience acceleration only in the gap and not inside

21

the electrodes. A magnetic field B is applied in orthogonal direction to the

electrodes that forces the protons on a circular trajectory. Every time they

reach the gap between D1 and D2 polarity of the electrode will have changed

and the protons will thus be accelerated. So the protons circulate inside the

system and get faster and faster and their trajectory gets larger and larger

until they are fast enough to leave the cyclotron. They then hit oxygen-18

to produce a neutron and fluorine-18 [5].

OH

OH

O

F

HO

HO

OH

OH

OHO

HO OH

Glucose Fluorodeoxyglucose

Figure 2.3: Chemical structure of Glucose compared to Fluorodeoxyglucose

The fluorine-18 is then used to mark sugar molecules and produce fluo-

rodeoxyglucose. Figure 2.3 shows the chemical structure of a regular sugar

molecule and the structure of a sugar molecule with the positron emitter18F attached.

Many medically relevant metabolic processes consume sugar. If sugar

marked with radioactive nuclei is introduced into those processes, the sugar

is consumed. Thus metabolic processes that use a lot of sugar can be made

visible with PET scanners. Cancerous tissue often has a disproportionately

high growth rate and thus uses a lot of sugar as its energy source. Hence it

is possible to detect many different kinds of tumors with PET.

2.2 Data Acquisition

The following chapter discusses how the data is acquired within a PET

system. The design of scintillation detectors is sketched and two different

acquisition modes are explained, namely 2D and 3D measurement. Addi-

tionally the challenges of the acquisition process are listed. Furthermore

data compression principles are explained because they are an integral step

of the reconstruction process.

22

2.2.1 Scintillation Detectors

One key to accurate data acquisition is to determine the precise location of

the gamma ray hitting the detector. The solution lies in the scintillation

detector design. The idea is to have many detectors that are as small as

possible to get high resolution. One problem arising when trying to use

many detectors is the size and the number of photomultipliers that are used

to detect the photons emitted while scintillation occurs. Their mode of

operation is as follows: they are able to amplify the signal from the crystals

and give off a short current pulse indicating that a gamma ray was detected.

The smaller and closer together the scintillation crystals are the better,

is the detection efficiency and the sampling frequency. The problem is that

using a phototube for each scintillation crystal is not practical. Therefore

a method devised by Casey and Nutt [10] is used in modern PET scanners

to reduce the number of photomultipliers and at the same time improve the

spatial resolution of the detection system. The idea is to group a block of

scintillation crystals together, for example 8 ∗ 8, and use 4 photomultipliers

to measure their light output. The 4 photomultipliers are able to determine

which one of the 8 ∗ 8 crystals was hit with the help of a lightguide that is

located between the crystals and the photomultipliers.

When a gamma ray hits a crystal the energy is converted to light. The

light is transmitted via the lightguide to all the phototubes. Depending on

their combined measurements they are able to determine which crystal was

hit. The x and y coordinate of the hit crystal within one detector block can

be calculated with

x =(b + d) − (a + c)

a + b + c + dand y =

(a + b) − (c + d)

a + b + c + d(2.6)

with a, b, c and d representing the light output of each of the four pho-

tomultipliers. Image 2.4 shows the schematic layout of one detector block.

The 8 ∗ 8 crystals, the lightguide and the four circular photomultipliers are

shown.

2.2.2 Detected Events

If an event is detected and counted as valid or not depends on three factors.

The two gamma rays have to hit the detectors within a specific time window

for the system to recognize the rays as belonging together and forming an

23

Crystal Block

Lightguide

Phototubes a

b

c

d

Figure 2.4: Schematic layout of scintillation detector block

event. This is known as the coincidence window. Each of the gamma rays

have to emit a specific amount of energy into the scintillation crystals that

is within specified boundaries. Lastly the ray formed by the two gamma

rays has to be an accepted line of response (LOR) that is a trajectory that

the system deems relevant. If an event conforms to all three criteria it is

called a prompt event or prompt. However false positives are possible due

to various reasons.

True Scatter

Figure 2.5: True and Scattered positron events

Figure 2.5 shows true and scattered positron events as they are detected

by the PET scanner. In the left image a true coincidence is shown that is

detected correctly. The two gamma rays reach the detector ring on opposite

sides, form a valid LOR and are detected within the energy and time window.

The image on the right shows a scattered event. One gamma ray is Compton

scattered [12] inside the subject. The detector system does not recognize

24

the scattering. Scattered rays can be well inside the energy window of the

detectors and thus the system can not distinguish between scattered events

and prompts. As a result a LOR between the two detectors is assumed

which does not correspond to the actual location of the positron decay.

Random Multiple

Figure 2.6: Random and Multiple coincidences

Image 2.6 shows a random event on the left side where two annihilations

take place at almost the same time. The detection system recognizes two

gamma rays that did not emit from the same annihilation as a correlated

coincidence as they are detected within the coincidence time window. As a

result a wrong LOR is assumed and a wrong count is registered. The right

image shows almost the same event however three photons are detected

within the time window. The system can not determine the correct line of

response so the counted photons are discarded and the events are lost.

Uncertain Attenuated

Figure 2.7: Uncertain and Attenuated positron events

Illustration 2.7 finally shows an uncertain event on the left side. At the

edge of the field of view emitted photons may hit more detector crystals

by passing through them. They only streak the first detector and deposit

some energy but are not stopped. Thus they are not counted in the detector

crystal they hit and as a result the protons are detected in the wrong crystal

25

and a wrong line of response is assumed and counted. This is called parallax

effect. Finally the last images on the right shows an attenuated photon that

is not counted by the detection system as it is no longer inside the energy

window. This results in a lost count because inside the time window the

system detects only one photon which is then discarded.

There are different techniques to work around those problems. For ex-

ample the detection system can be optimized so that the detection time

window is very tight. Additionally the detection window can be optimized

so that only protons with the correct energy are detected. Another way to

reduce the effects of inaccurate measurement is to account for those prob-

lems during reconstruction. Multiple algorithms are available to correct for

the effects of random, scattered and attenuated events.

2.2.3 2D and 3D Data Acquisition

There are two different acquisition modes in PET imaging. Older scanners

often operate in 2D acquisition mode. In this mode septa made from lead or

tungsten separate each crystal ring from one another thus confine the axial

angle of possible lines of response. Gamma rays that move towards the

detection system on an oblique angle hit the septa and are absorbed. Only

events that hit crystals on the same ring or on neighboring rings are counted.

Those events follow a trajectory perpendicular or nearly perpendicular to the

axial axis of the scanner. Figure 2.8 shows the different possible gamma ray

trajectories for 2D acquisition mode (top) with the septa in place between

the crystals and for 3D acquisition mode (bottom) without the septa.

2D acquisition is not optimal due to the afore mentioned restrictions

that essentially result in loss of events. Only a very small fraction of all

emitted events (about 0.5%) are detected. Removing the septa results in a

lot more possible trajectories, more detected events and therefore a higher

scanner sensitivity but also a higher fraction of scattered events. Photons

with flight paths oblique to the axial axis of the scanner are measured too.

The benefit in this increase of sensitivity can be seen after reconstruction.

Images reconstructed from 3D measurements show a statistically significant

reduction in image noise compared to those reconstructed from 2D mea-

surements [30]. 3D datafiles however can be multiple times larger than 2D

data files as they contain a lot more different LORs. Reconstruction time is

significantly influenced by the size of the input data and fully 3D reconstruc-

26

tion adds a certain amount of complexity to the reconstruction algorithm

and its implementation. Fully 3D Image reconstruction can take up to an

order of magnitued longer than 2D reconstruction.

Figure 2.8: Difference between 2D and 3D data acquisition mode

2.2.4 Optimization

To improve the sampling in axial direction, intermediate slices are created.

They contain LORs from two neighboring crystal rings. Figure 2.9(a) shows

that this results in 2 ∗Rcr − 1 samples in axial direction when Rcr is to the

number of crystal rings in the scanner. The dotted lines represent the ad-

ditional gamma ray trajectories that are taken into account and the dashed

lines visualize the intermediate LORs.

The number of parallel LORs that are accepted by the detection system

is limited. The LORs at the edges of the gantry are not relevant because the

subject is positioned in the middle. For example as shown in Figure 2.9(b)

the horizontal sinogram plane with θ = 0◦ contains LORs from 315◦ to

45◦. This is a very effective method to remove unnecessary data in order to

save memory and reduce the computational complexity of the reconstruction

process. It also avoids bad parallax effects. The volume of all LORs that

are not discarded form the field of view (FOV) of the scanner.

27

(a) Intermediate samples (b) Sinogram width and relevant LORs

Figure 2.9: Efficient data structuring

2.3 Coordinate System, Data Structures and

Compression

This section describes the coordinate system used for PET systems and

two different ways to interpret the data. It explains how the essential data

structures are constructed and how the data can be compressed for easier

storage and faster image reconstruction.

2.3.1 Coordinate System

The coordinate system is defined so that every possible line of response can

be described explicitly. There are 4 parameters describing a line of response:

the inclination along the axial axis of the scanner φ, the projection ρ which

is the transaxial axis intercept, the azimuthal angle in the transaxial plane

θ and the axial axis intercept z . Figure 2.10 illustrates this concept. While

the range of θ is a full cycle the axial inclination φ is only a couple of

degrees depending on the depth of the scanner. This is the coordinate

system for data measured in 3D acquisition mode. 2D data is missing the

first parameter - the inclination along the axial scanner axis.

The sinogram can thus be seen as a four-dimensional array that contains

in each of its elements the number of coincidence events that were detected

along the corresponding LOR. One element of this array is called a bin. A

28

sinogram view consists of all projections for a given projection angle θ.

Figure 2.10: The PET coordinate system

The angle φ depends on the transaxial axis intercept due to the circular

nature of the detector. LORs further away from the center have a steeper

angle φ than LORs at the center. The LORs form a distorted plane. Figure

2.11 shows this issue in an exaggerated manner. This has to be considered

when modeling the scanner geometry for the reconstruction process.

Figure 2.11: Oblique LORs are not parallel

29

2.3.2 Parallel Beam Space and Line of Response Space Sino-

grams

Projecting the arc-shaped crystal ring onto a plane results in bins of different

width. The bin size depends on the radial coordinate resulting in wider bins

at the center of the projection and smaller bins at the edges. The sinogram

with varying bin size is called LOR space sinogram.

Figure 2.12: Comparison of LOR and PB space projection

A PB space sinogram can be calculated by applying radial arc correc-

tion with linear interpolation to the LOR space sinogram. The difference is

illustrated in Figure 2.12. Parallel beam space sinograms are easier to op-

erate on because a constant distance can be assumed between neighboring

bins. However as a result of linear interpolation during radial arc correction

the bins are not statistically independent of one another anymore which can

cause problems during further reconstruction steps that require the data to

be statistically independent from each other.

2.3.3 Sinogram Compression

To reduce the required storage space, data transfer times and to simplify

computations done on 3D sinograms, different methods are applied to com-

press the data. A commonly used compression method is spanning. The

span concept is applied on the axial axis of the scanner and compresses di-

rect (φ = 0) and oblique events (φ 6= 0) by merging them into one single

30

LOR. This is possible because the events are emitted from the same loca-

tion, their trajectory just differs in φ. The number of events merged into

one LOR defines the level of spanning. If for example 5 events from inter-

mediate planes and 6 events from direct planes are merged the dataset is

called a span 11 sinogram. So the number of events merged is denominated

by a number following the word span. Figure 2.13 illustrates this principle.

Each solid and dotted lines are merged into one LOR as they originated

from approximately the same location. The spanning reduces resolution at

the edges of the FOV, due to the information loss bud also reduces data size.

Therefore it is a trade-off and a small span is preferable. Siemens Healthcare

PET systems use mostly span 11.

Figure 2.13: The span concept for spans 3, 5, 7 and 9

As there are LORs with a larger angle φ than are covered by the initial

span, segments are created. Segments partition the LORs depending on

the steepness of their angle φ. Segment 0 contains all LORs with an angle

φ within span 11, segments −1 and 1 contain LORs with larger −φ and

φ respectively, and LORs with even larger φ are sorted into segments −2

and 2 and so on. The number of segments depends on the maximum ring

difference, the largest ring difference that is allowed for an LOR. This also

defines the maximum of φ. The segment concept is illustrated in Figure 2.14.

Groups of LORs with steep angles are sorted in Segment 1 and Segment -1

respectively.

As a side effect of the compression the axial resolution close at the border

is reduced. In the border areas of the gantry the oblique events can not

be measured as their trajectories only cross the detector system on one

side. Therefore the number of z slices varies from segment to segment. For

31

Figure 2.14: Segments 1, 0 and -1 for span 9 and mrd 13

segment 0 it can be calculated by:

slices(0) = 2 ∗ Rcr − 1 (2.7)

where #rings is the number of crystal rings. This is true because of the

intermediate samples that have been introduced between crystals. For all

segments other than 0 the number of slices can be calculated by:

slices(n) = 2 ∗ #rings − 1 − 2 ∗(−span − 1

2+ |n| ∗ span

)(2.8)

where the expression in brackets equals to the minimum ring difference

of the segment, which equals to the number of missing slices on one end of

the detector system. For each segment the angle φ can be calculated by

φ(n) =

tan−1

(n∗span∗zcrystal

Gradius

)if n ≥ 0

−tan−1(−n∗span∗zcrystal

Gradius

)if n < 0

where zcrystal equals to the size of one crystal in z direction and Gradius

is the radius of the gantry including the depth of interaction of a gamma

ray in the crystal. The depth of interaction is the distance a photon travels

inside the detector crystal before it has released all its energy.

A suitable way to visualize the span and segment configuration of a

scanner is the michelogram named after Christian Michel. In the diagram

32

the horizontal axis shows the crystal rings on one side of the scanner and

the vertical axis the rings on the other side of the scanner. A dot or an

asterisk denotes that those two crystal rings form an allowed gamma ray

trajectory (LOR). An empty bin specifies and forbidden connection of rings

due to the limitation given by the maximum ring difference. Connected

dots indicate that those rings together form one allowed sinogram plane.

Figure 2.15 shows a michelogram of a PET system with 55 crystal rings,

a maximum ring difference of 38 and 7 segments. The segments are the

diagonally connected areas.

To compare the data size a dataset produced by an older scanner with

32 crystal rings, a maximum ring difference of 22 and 5 segments would be

about 75.62MB large (assuming 239z and 288ρ and 288θ samples and 4 byte

for each bin). A typical dataset from a modern scanner like the one shown

in the michelogram would be 240.74MB large (assuming 559z and 366ρ and

366θ samples). Without the compression however (i.e. span1) the dataset

from that scanner would be roughly 1192.51MB large assuming the number

of possible LORs in z direction to be 2769.

Figure 2.15: Michelogram of a 55 ring, 38mrd, span 11 3D PET system

33

2.3.4 Other Data Structures

The image is a cuboid represented as three-dimensional array with x y and

z coordinates. x and y correspond to the transaxial axis intercept ρ and its

azimuthal angle θ and thus describe the transaxial plane of the scanner. z

directly corresponds to the axial axis intercept of the sinogram or the depth

coordinate of the scanner.

A circular mask is used in conjunction with the x -y plane of the image

to restrict algorithms to only access and process data that is relevant for

reconstruction. This is done to restrict the image to the are of the FOV

that is covered by all projection angles θ. It restricts the x coordinate of the

transaxial plane depending on the y coordinate. A two-dimensional array

stores xstart and xstop values for each possible y . So for a given y only x

coordinates between xstart and xstop should be accessed.

2.4 Image Reconstruction

Image reconstruction is the process that covers all steps, methods and algo-

rithms that are necessary to generate an image from the scanner’s measure-

ments. It contains pre- and post processing steps, correction methods and

in the center of this process is the reconstruction algorithm. Two different

approaches to image reconstruction algorithms exist.

Analytic algorithms assume a continuous data sampling. The data is

discretized after reconstruction. The most important analytic algorithm is

Filtered Back projection. It has the advantage of being fast and allows an

easy control of the spatial resolution and noise correlations. Its drawbacks

are less spacial resolution due to smoothing in the filtering step and higher

image noise. The reason for this is that the algorithm is based on an in-

accurate model of PET physics that was originally developed for computed

tomography. Despite those shortcomings it still is a relevant reconstruction

method nowadays. The filtered back projection algorithm consists of two

basic steps. The scanner’s measurements are in principal filtered to enhance

the image by for example reducing noise and increasing the contrast. After

filtering the data is backprojected into the image.

In contrast to the analytic methods stand the algebraic algorithms. They

depend on the discrete representation of input data and reconstructed image.

The reconstruction problem is described as linear equation system

34

b = A ∗ x ⇒ x = b ∗ A−1 (2.9)

where x is the image, b is the sinogram and A is the geometry description

of the gantry. This linear equation system is solved with iterative optimiza-

tion algorithms since A−1 can’t be calculated directly. The most common

method used to solve this LES is the Maximum-likelihood Expectation-

Maximization algorithm (MLEM). One of the most important evolutions in

PET in the last years has been the increasing role of iterative methods. They

are based on a more accurate model of the image acquisition process and

therefore improve the quality of reconstructed images. The major drawback

of those methods is the higher computational complexity which reflects itself

in the time it takes to reconstruct an image. The runtime of an iterative

algorithm can be several orders of magnitude longer than those of analytic

methods.

Figure 2.16: Comparison of analytic and iterative reconstruction algorithms

Figure 2.16 shows the difference in image quality between two common

analytic and iterative image reconstruction algorithms. The top row shows

the image reconstructed with filtered back projection; reconstruction time

was 10 seconds. The bottom row shows the image reconstructed with the

unweighted ordered subset expectation maximization algorithm, reconstruc-

tion time was 170 seconds. Due to less noise in the image calculated with

35

an iterative method it is easier to read and interpret the image. The image

reconstructed with filtered back projection is noisy and less detail is visible.

The superior image quality achieved with iterative methods makes them

the algorithms of choice. Additionally filtered backprojection does not work

very well with low count rates. Due to the need of short scan times iterative

methods are preferred.

2.4.1 Basic Principles

The LORs contained in the sinogram can also be interpreted as projections.

The process of projecting is equivalent to mapping the positron emissions

towards an angle θ in a three dimensional subject onto a two dimensional

plane. The projection consists of a number of line integral over emissions

with angle θ through the subject. The line integral represents only one frac-

tion of the positron emissions through the subject, namely all the emissions

in the direction of the integral. For this reason tomographic reconstruction

for PET differs from tomographic reconstruction for CT. In CT the entire

attenuation of the X-ray through the body is known for any given angle

θ. In PET only the counts that were emitted in θ direction are measured.

Therefore the mathematical model for CT does not fit PET imaging per-

fectly, however reasonably good images can be reconstructed by filtered back

projection derived from CT.

Figure 2.17 shows the basic principle of projection. The numbers visible

in the subject are the number of emissions in θ = 90◦ direction. They are

counted over time and represented as pixel image and a continuous func-

tion. This kind of projection is called forward projection as measurements

from the subject are projected forward onto a plan. During a PET scan

projections from all possible angles are generated by taking all emissions

at all angles into account. If for example an event at the angle θ = 36◦ is

measured, it is added to the equivalent sinogram view. Because the photons

are emitted at approximately 180◦ to each other the sinogram views with θ

are equivalent to the sinograms θ = +180◦ and thus sinograms cover only

the range [0◦, 180◦).

36

Figure 2.17: Projection of positron emissions at θ = 90◦ into sinogram row

2.4.2 The Maximum-Likelihood Expectation Maximization

Algorithm

Today the Maximum-Likelihood Expectation Maximization (ML-EM) algo-

rithm is one of the most widely used iterative algorithms in PET. Different

versions of the algorithm were developed and various improvements to the

algorithm are in use. The basic principle however is the same. A statistical

algorithm for calculating the maximum likelihood from incomplete data was

published by Dempster et al [13] in 1977. 5 years later Shepp and Vardi [48]

developed a new very accurate mathematical model of emission tomography.

Using this model they were able to use the EM algorithm for image recon-

struction. They showed that this algorithm reduced the statistical noise

artifact compared to algorithms in use at that time.

The input data of the algorithm consists of the emission sinogram m∗

and a 3D image λ. In the first iteration the image λ0 is homogeneous, for

example each voxel has the value 1. The algorithm calculates the most

likely image λ∗ that could have produced the emission sinogram at hand.

For that purpose it iteratively modifies the 3D image λn to approximate

it’s emission mn to the measured emission m∗. This is done by applying a

correction factor in each iteration. This factor is calculated by reprojecting

the quotient of the initial emission m∗ sinogram over the emission sinogram

mn that the current image produces. The informal equation 2.10 emphasizes

this principle for an iteration n.

37

λn+1 = λn ∗ back-projection

(m∗

forward-projection(λn)

)(2.10)

The normalization factor in equation 2.11 is necessary after each iteration. It

divides the image by a summation of all coincidence lines. This compensates

for areas of the FOV that are sampled with differing numbers of LORs.

1

back-projection(1)(2.11)

The formal termination condition is convergence of the solution. However

in practical applications it can either be a fixed number of iterations or

a quality factor of the reconstructed image. Sometimes visual judgment

of the image is called upon to determine when the reconstruction can be

stopped but PET system manufacturers also measure convergence behavior

and decide then on a recommended number of iterations.

Mathematical Model after Shepp and Vardi

This section reproduces the mathematical model of emission tomography

devised by Shepp and Vardi [48]. The image is seen as a density distribution

of emission events λ = λ(x , y , z ). The measured data is a function m∗(d)

representing the number of counts measured in each of D detector units d .

The problem is to estimate the density distribution λ from the the measured

data m∗. The density distribution λ(x , y , z ) is discretized into boxes b =

1, . . . ,B giving an unknown number of counts m(b) in each box. So we

want to estimate λ(b) or guess the number of unobserved counts m(b) in

each box.

Shepp and Vardi model this by defining λ(b) as the integral of λ(x , y , z )

over box b and m(b) as a Poisson distributed number with mean λ(b) that

is generated independently in each box as:

P(m(b) = k) = e−λ(b) λ(b)k

k !, k = 0, 1, . . . (2.12)

They assume a probability

p(b, d) = P(detected in d | emitted in b) (2.13)

38

and conclude that the probability of an emission in b being detected at all

is given by

p(b) =D∑

d=1

p(b, d) ≤ 1. (2.14)

They further show that the density of the detected counts is equal to the den-

sity of emitted counts λ(x , y , z ) so that equality holds in 2.14. They note a

problem with this model regarding the discretization of λ(x , y , z ) into boxes

b = 1, . . . ,B . However they argue that it is an acceptable approximation

and more accurate compared to using transmission tomography models for

emission tomography. They also mention that their physics model is not

entirely accurate as it doesn’t take scatter, randoms and multiple emissions

into account.

ML-EM Algorithm for PET after Shepp and Vardi

From the above model Shepp and Vardi construct a likelihood function, cal-

culate the maximum likelihood and provide an iterative approach to max-

imization based on the EM algorithm. They give the likelihood function

by

L(λ) = P(m∗|λ) =∑

A

∏

b=1,...,Bd=1,...,D

e−λ(b,d) λ(b, d)m(b,d)

m(b, d)!(2.15)

which gives the likelihood that the emission density distribution λ leads to

measurements m∗. They simplify this equation by taking the logarithm of it.

This is possible because of the increasing nature of the logarithmic function

which does not affect the maximization [28].

l(λ) = logL(λ) (2.16)

They go on with finding the maximum of this function and conclude by

providing an iterative method for maximizing the maximum likelihood l(λ)

using the EM algorithm [13]:

39

λnew (b) = λold (b)D∑

d=1

m∗ ∗ p(b, d)B∑

b′=1

λold (b ′)p(b ′, d)

. (2.17)

λ is defined as the maximized probability distribution and p(b, d) can be

seen as the fraction that b contributes to the projection ray described by d .

create initial image λn filled with onesload emission measurements m∗

n = 0

loop until image converges (λn ≈ λn−1 ≈ λ∗) orother termination condition is reached

calculate forward projection mn of current imageλn

divide emission by forward projection mn = m∗/mn

backproject the quotient mn into λtmp

calculate next image λn+1 = λn+1 ∗ λtmp

n = n + 1

Figure 2.18: Basic ML-EM algorithm

An implementation of the maximum likelihood expectation maximiza-

tion algorithm would approximate the image λn over n iteration steps until

the result converges. In every iteration the image λn is forward-projected

into mn and then m∗ divided by mn is back projected. Figure 2.18 shows

this algorithm in pseudo-code.

OS-EM Algorithm after Hudson and Larkin

The ML-EM algorithm described above requires many iterations until the

image converges. Each iteration requires a substantial amount of computa-

tion namely the forward- and back-projection of the entire dataset. In 1994

Hudson and Larkin [18] improved the ML-EM algorithm to increase the rate

of convergence thus reducing the algorithm’s complexity.

They introduce subsets and specify an iteration as one pass through all

subsets. Subsets are means of partitioning the entire dataset in a way that

information different from each other is introduced into the calculation as

soon as possible. Different information in essence means projections from

angles as far apart as possible.

40

Figure 2.19 shows the modification of the algorithm. Whereas the ML-

EM uses the entire dataset in each iteration and calculates the correction

factor λtmp for it, the OS-EM algorithm calculates the correction factor

for each subset. One subset contains much less data than the entire set

and the projection steps are more complex than the division m∗/mn and

the multiplication λn+1 ∗ λtmp . Hudson and Larkin show that with this

modification to ML-EM a speedup of one order of magnitude is possible.

The speedup originates from the fact that less data has to be forward- and

back projected in one subset compared to one iteration of the original ML-

EM. At the same time the image approximated with one subset calculation

in OS-EM roughly corresponds to the image approximated with one full

iteration of ML-EM. So if 8 subsets are used, the image converges 8 times

faster when compared to regular ML-EM.

create initial image λ0 filled with onesload emission measurements m∗

partition emission measurements m∗ into s subsets over angleθn = 0

for(i=0; i < iterations; i++)

for(si=0; si < s; si++)

calculate forward projection mn of current imageλn

devide emission by forward projection mn = m∗/mn

backproject the quotient mn into λtmp

calculate next image λn+1 = λn+1 ∗ λtmp

n = n + 1

Figure 2.19: Basic OSEM algorithm

It is practical to create the subsets at a view level where one view v

contains all projections with the same azimuthal angle θ. For this reason

the number of subsets s must divide the number of views #θ in the sinogram

(2.18).

#θ mod s = 0 (2.18)

Equation 2.19 shows how the set of views sx belonging to one subset x

can be calculated where #θ/s is the number of views per subset. Subsets

41

are enumerated within {1, . . . , s}.

sx =

#θ/s⋃

i=1

(i − 1) ∗ #θ

s+ x − 1 (2.19)

42

Chapter 3

Hybrid implementation of

PET image reconstruction

3.1 Projectors

The projectors are algorithms to project the sinogram into the image (back-

projection) or to project the image into the sinogram (forward-projection).

Forward-projection is essentially what happens during the data acquisition

process. During the iterative reconstruction process the data has to be back-

and forward-projected multiple times. Because of that the projectors are an

integral and at the same time the most time consuming part of the recon-

struction process. About 90% of the reconstruction time is spent calculating

forward- and backward-projections (not counting correction methods for e.g.

randoms or attenuation and other pre- and post processing steps). So when

thinking about speeding up the reconstruction process the projection algo-

rithms are the first thing to look at.

3.1.1 Ray projection through pixel images

The projectors are based on an algorithm called Joseph’s Method derived

from a paper published by Peter Joseph in 1982 [26]. In the paper the two -

at that time state of the art - techniques for reprojecting rays through pixel

images are discussed and a new, more accurate method is introduced. The

algorithm is explained for two dimensional space however the method can

be easily expanded to three dimensions. Joseph defines rays as straight lines

43

y(x ) = − cot (ω)x + y0 or (3.1)

x (y) = − tan (ω)y + x0 (3.2)

with ω representing the angle between x -axis and ray. It is assumed that

the image f (x , y) is a smooth function. Depending on ω Joseph defines the

line integral as an integral over x or y depending on ω:

if | sin (ω)| ≥ 1√2

integral in x direction (3.3)

if | cos (ω))| ≥ 1√2

integral in y direction (3.4)

which means that for flat angles between 45◦ and 135◦ and −45◦ and −135◦

the line integral is

1

| sin (ω)|

∫f (x , y(x )) (3.5)

and for all other angles it is

1

| cos (ω)|

∫f (x (y), y). (3.6)

Joseph continues by approximating the respective integral by a Riemann

sum. Due to the case differentiation for ω, a ray never hits more than 2

pixels per row. Joseph calculates the exact location of the ray hitting the

respective row and calculates the linear interpolation between them. The

line integral is then the sum of the interpolations of each row multiplied

by the scaling factor. Assuming the ray hits column n at y(xn) which is

somewhere between pixel m and m + 1 the linear interpolation between the

two pixels Pn,m and Pn,m+1 is calculated by:

Plerp = w ∗ Pn,m+1 + (1 − w) ∗ Pn,m (3.7)

with w being the non integer part of y(xn)

w = y(xn) − ⌊y(xn)⌋. (3.8)

44

3.1.2 Projector Algorithm

The projector algorithm used in this work is based on the sheared projector

by Panin and Kehren [43] which uses Josephs method for rayprojecting rays

through pixel images. They presents a very efficient algorithm for back- and

forward-projection of LOR based 3D sinograms utilizing simple shearing

operations. The algorithm is inspired by the paper ”A fast algorithm for

general raster rotation” by Alan W. Paeth [42] which presented an efficient

way to rotate raster images. To calculate the rotation, the rotation matrix is

separated into shear matrices that are applied consecutively. The algorithm

presented here exploits all the advantages of the Rotate- and Slant algorithm

but uses Joseph’s Method for the actual interpolation.

The idea presented in [42] can be leveraged in 3D projector algorithms

as the basic operation of the projector is to rotate a 3D image volume. To

project the bins of segment 0 with φ = 0◦ a linear interpolation has to be

calculated. The image is only rotated by θ around one single axis. For the

other segments however a bilinear interpolation has to be calculated as both

angles θ and φ are 6= 0.

The algorithm utilizes the fact that the bilinear interpolation for the

oblique segments can be separated into two linear interpolations one for θ

and one for φ. The linear interpolation for θ is the same for an entire view

independent of φ. So when rotating the image around the axial axis to

calculate segment 0 an intermediate image called the sheared image is saved

and used again for the other segments of this view. As a result only the

linear interpolation has to be calculated for every φ using this sheared image.

Redundant calculations are eliminated and the computational complexity is

reduced. Figure 3.1 illustrates this principle. The first image shows the

projection of segment zero and the calculation of the sheared image at the

same time. The next image shows how the sheared image and segment zero

of the view relate to each other and on the left side an oblique segment is

calculated utilizing the sheared image.

Assuming y(xn) and x (ym) are precomputed and there are four times

more oblique events in the sinogram than straight LORs the complexity

O(n) is reduced by about 45% compared to a method that calculates both

linear and bilinear interpolations.

Back- and forward-projectors are basically the same operations just in

reverse order. The forward-projection algorithm’s input is the 3D image

45

Figure 3.1: Projector using sheared image

volume. The segment 0 part of the sinogram view is calculated and at

the same time the sheared image is stored. Using this sheared image the

other segments of the sinogram view are calculated. The sinogram is then

scaled by the corresponding factor as described in section 3.1.1 Joseph’s

Method. The back-projector works the other way around. Its input is the

sinogram which is scaled after Joseph’s Method. Secondly its segments 6= 0

are projected into the image which results in the sheared image. Then the

image is unsheared and segment 0 is projected.

3.1.3 Implementation

The projectors are implemented in C++ optimized for multi-core and multi-

processor systems. They utilize parallel programming paradigms on two

levels.

On the lowest level, the single instruction multiple data (SIMD) princi-

ple as it is common in modern x86 architectures is used. The SSE extension

in x86 CPUs allows programmers to use vector based operations. SSE ca-

pable processors can operate on up to four 32bit floating point values at the

same time with one single instruction. The extension allows the usage of

additional data types namely m64 (MMX), m128 (SSE1) and m128d

as well as m128i (SSE2). m64 is a 64 bit wide variable that can store

multiple data types (8x8bit, 4x16bit etc.). The most interesting data type is

m128 as it can store four 32bit floating point values. One prerequisite to

46

vo i d ∗ mm malloc ( i n t s i z e , i n t a l i g n )vo i d ∗ mm free ( vo i d ∗p )

Listing 3.1: SSE memory allocation

using m128 is to align the data in memory to 16 bytes. This is because the

mm load and mm store intrinsics expect the last 4 bits of the address to

be zeros. This can be achieved by using the functions in listing 3.1 instead

of malloc and free. mm malloc takes an additional parameter to specify the

alignment boundary. This is documented in Intel C++ Intrinsics Reference

[20].

As a consequence apart from using special functions to allocate and free

memory the data arrays, in particular the sinogram and the images have

to be padded in a way so that their size is a multiple of 16. It is done by

simply increasing the z coordinate of sinograms and images. The padding

introduces an overhead, which is outweighed by the considerable speedup

gained from using SSE. Instead of loading every single floating point variable

and operating on it, four variables can be loaded at a time and four variables

can be operated on in parallel.

The second parallel programming paradigm used in the projector’s im-

plementation is multithreading. The algorithm creates multiple threads and

calculates the projections in parallel. In this particular implementation a

view based parallelism is used, so that multiple sinogram views are pro-

jected at the same time. This means that the inner loop of the OSEM

algorithm is parallelized (compare Figure 2.19) to speed up the calculation

of one subset. During forward-projection every thread accesses the same

image to calculate segment 0 of the sinogram. At the same time every

thread saves its own sheared image because every thread calculates a dif-

ferent view angle θ. Then every thread calculates the other segments using

its sheared image. The sheared image is discarded after projection. During

backward-projection every thread projects the segments 6= 0 into it’s own

sheared image and unshears it while projecting segment 0. After all views

are projected, the resulting images from each thread have to be added up

resulting in the final image.

Additional optimization is applied to the algorithm as described in [27].

The most important is the index reshuffling of the image. While the index

47

order normally would be x , y and z , the projectors use the z , x , y ordering

scheme. This allows the projectors to access memory in a contiguous way

because the innermost projector loop iterates over z . With index order zxy

this loop can access elements right next to each other in memory.

3.2 Optimization of CPU projector

A significant drawback of the CPU projector implementation described in

3.1.3 can be found in the back-projection part of the algorithm. For each

used thread a separate image is required. After projection, the images of

all views have to be summed up. This can become a scalability problem as

CPUs today contain more and more cores and therefore more images would

have to be added up.

3.2.1 Analysis of current Implementation

An analysis of the existing implementation demonstrates this issue. The

algorithm is executed with multiple numbers of threads (1-16) on two dif-

ferent systems. A 2x dual core system comprised of two dual core 2.66GHz

Intel Xeon processors with a bus speed of 1333MHz and 4MB cache and a

4x quad core system with two quad core 2.66GHz Intel Xeon processors also

with a bus speed of 1333MHz and 8MB cache.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16120

170

220

270

300

Number of Threads

Run

time

in s

econ

ds

2x dual core CPUs2x quad core CPUs2x dual core CPUs2x quad core CPUs

(a) Runtime in seconds

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160

0.2

0.4

0.6

0.8

1

Number of Threads

Par

alle

l Effi

cien

cy

2x dual core CPUs2x quad core CPUs2x dual core CPUs2x quad core CPUs

(b) Parallel efficiency

Figure 3.2: CPU projector efficiency analysis

Figure 3.2(a) shows the reconstruction runtime plotted against the num-

ber of threads running for both the dual core and the quad core systems.

Both systems are fastest when 4 threads are used for reconstruction. Addi-

tionally, the dual core’s system overall performance is slightly better than

48

that of the quad core system. Similarly, when plotting the parallel efficiency

E =1

P

Tseq

T (P)(3.9)

where P is the number of threads, Tseq is the sequential time (1 thread) and

T (P) is the runtime using P threads as in Figure 3.2(b) the efficiency drops

below 0.5 when more than 4 threads are used. The efficiency is also very

similar for both systems, which comes as a surprise as the quad core system

should in theory be able to execute twice as many instructions as the dual

core system.

The reason for this is that the algorithm is inherently memory bound.

There are relatively few computations per memory access and large amounts

of data have to be accessed. Each thread has to iterate over an entire image

and access its elements. Memory access of different threads is thus scat-

tered across system memory which inhibits optimization mechanisms like

caching and fast consecutive memory access. Additionally if more threads

are used more memory is used. The exponential decrease in efficiency (or

the exponential increase in parallel overhead) can be attributed to context

switching between threads and the scattered memory access caused by using

multiple separate images for each thread and the overload of the memory

bus. Furthermore the adding of all images after back projection adds extra

overhead.

3.2.2 Optimization of current Implementation

The algorithm can be optimized by a more fine-grained domain decompo-

sition. The current implementation parallelizes the loop over the different

view angles θ. Each thread calculates it’s own view. One level below the

view-loop is the loop over the integration direction (compare subsection

3.1.1). So the idea is to let every thread calculate the same view at the

same time, but each view calculates only a small part of the image. Figure

3.3 shows the algorithm for back projecting one sinogram view. The opti-

mized algorithm would parallelize the outer loops over x . Thread Tn would

loop x from n ∗ ρ-samples/#T to (n + 1) ∗ ρ-samples/#T − 1 where #T

is the number of threads. The problem set is 5 dimensional: view , x , y ,

segment , z and instead of spreading the highest dimension view over multi-

ple threads the integration direction being the second highest dimension is

49

decomposed. The advantage is that each thread can operate on the same

image which draws memory access closer together and therefore improves

cache usage and also avoids the summation process of each thread’s images

after all views are projected. Additionally some factors are the same for an

entire view calculation and thus can be shared by all threads.

This new method was implemented and runtime tests were performed

to evaluate the efficiency gained from the modification. Figures 3.4(a) com-

pares the runtime of the two algorithm implementations on the dual core

system and 3.4(b) shows the equivalent data for the quad core system. For

both systems the runtime decreases with an increasing number of threads.

On the dual core system reconstruction takes about 143 seconds with 8

threads using the original algorithm implementation but takes only 113 sec-

onds with 8 threads using the optimized implementation.

Overall the performance increase gained from adding more threads is not

significant. This is independent of the number of CPU cores available in the

system. The reason for this is found in the memory bound nature of the

algorithm and also having more threads than cores does not make a lot of

sense due to computation time that is lost in the increased number of task-

switches. However when looking at the two different implementations it is

important to determine on what level the algorithm should be parallelized.

It affects how well the underlying hardware is able to optimize execution as

well as the magnitude of the parallel overhead.

Josephs’s method scaling

for(x=0; x < ρ-samples; x++)

for(y=0; y < ρ-samples; y++)

for(segment=1; segment<segments; segment++)

for(z=0; z < z -samples; z++)

calculate linear interpolation for oblique segments

shared image calculated, now unshear and project segment 0

for(x=0; x < ρ-samples; x++)

for(y=0; y < ρ-samples; y++)

for(z=0; z < z -samples; z++)

calculate linear interpolation for segment 0 and un-shear

Figure 3.3: Back-projector structogram

50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

120

170

220

270

Number of Threads

Run

time

in s

econ

ds

2x dual core original2x dual core modified2x dual core original2x dual core modified

(a) Runtime dual core system

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

120

170

220

270

Number of Threads

Run

time

in s

econ

ds

2x quad core original2x quad core modified2x quad core original2x quad core modified

(b) Runtime quad core system

Figure 3.4: Comparison of original and modified implementation

3.3 The Compute Unified Device Architecture

The Compute Unified Device Architecture (CUDA) is a programming frame-

work based upon a new generation of graphics cards produced by Nvidia.

The framework makes it easy for programmers to leverage the power of mod-

ern graphics devices for general purpose computation on GPUs (GPGPU)

without having to understand the graphics pipeline. A paradigm shift in the

design of graphics processing units has made this possible. Prior to CUDA,

graphics cards were designed as special purpose processors - they contained

many small special purpose functional units that all performed a dedicated

task. Those functional units were chained together to form the graphics

pipeline. With new generation GPUs this design is superseded by a more

generalized approach. Instead of dedicated elements that are designed to

only solve one specific task, graphics cards today consist of general purpose

functional units that can be ”dynamically allocated to vertex, pixel, geom-

etry or physics operation” [38] to form the traditional graphics pipeline.

This is important for backward compatibility. Important for general pur-

pose computing on the GPU is the fact that those functional units can not

only be programmed to act as the traditional graphics pipeline but to solve

virtually any existing data parallel problem. This could not be done in an

easy way before as older graphics cards had a graphics-specific instruction

set, could only perform gather and no scatter memory operations and were

in general limited by the graphics pipeline.

Other than Nvidia many other companies are working on similar tech-

nologies. ATI/AMD offered ”CTM - Close to Metal” which is a GPGPU

51

interface for their ATI graphics cards [2]. There’s also an open source

framework called BrookGPU that ”abstracts and virtualizes many aspects

of graphics hardware” [7]. Additionally there are a number of commer-

cial products on the market that focus on virtualizing a number of parallel

hardware architectures including GPUs, the Cell processor and multicore

CPUs for high-performance computing (HPC). In 2008 the Khronos Com-

pute Working Group was formed to standardize general purpose parallel

programming. The work group released the OpenCL 1.0 standard by the

end of 2008 [34]. It is very similar to the ideas behind CUDA as Nvidia was

involved in the creation of the standard but incorporates both CPUs and

GPUs in it’s architecture.

The following chapter discusses the hardware architecture of the CUDA

capable Nvidia G80 graphics chip in more detail. This chip is used in Nvidia

Geforce 8800 and Nvidia Tesla products, the devices that were used in this

work.

3.3.1 Hardware Architecture

Producers of graphics processor tend to be secretive about implementation

details of their products. However to write efficient programs for a given

platform one has to have a good understanding of the underlying hardware

architecture. Information about it can be found in the Nvidia Geforce 8800

GPU Architecture Overview [38], the CUDA Programming Guide [39] and

the course material from a CUDA course at University of Illinois [19]. Self-

conducted experiments helped to clarify certain aspects of the architecture.

One Geforce 8800 chip can be seen as a streaming processor array (SPA).

Figure 3.5 shows the principal schematics of the 8800 chip. It contains a

number of texture processing clusters (TPC) which contain texture units

(TEX) and streaming multiprocessors (SM). The streaming multiproces-

sors consist of stream processors (SP) and super function units (SFU). In-

struction fetch and decode units are part of the streaming multiprocessor.

Therefore all stream processors on a multiprocessor execute always the same

instruction. Branching on a multiprocessor is possible by disabling unnec-

essary stream processors. A Geforce 8800 Ultra chip contains 8 TPCs with

2 SMs each and each SM contains 8 SPs resulting in 128 (8 ∗ 2 ∗ 8) stream

processors running at 750MHz. Table 3.1 gives an overview of the different

kinds of memories that are available at different levels of the architecture.

52

Figure 3.5: Geforce 8800 architecture

Memory Location Cache Access Scope Latency

Register On Chip N/A RW 1 Thread 1 CycleShared On Chip N/A RW Block 1 CycleLocal Off Chip No RW 1 Thread 400 - 600 CyclesGlobal Off Chip No RW All Threads 400 - 600 CyclesConstant Off Chip Yes R All Threads DRAM, cachedTexture Off Chip Yes R All Threads DRAM, cached

Table 3.1: CUDA Memory Overview

Every streaming multiprocessor has an on-chip register set with 8192

32bit registers. The shared memory is 16kB on-chip memory per multipro-

cessor divided into 16 banks for CUDA 1.0 devices. It is one of the most

important advantages over old GPU architectures for GPGPU as this mem-

ory represents a programmable cache and when used efficiently can result in

a huge memory access speedup. Local memory is comparatively slow mem-

ory that resides on off-chip DRAM and is only used if not enough registers

are available. Global memory is the main device memory and can be up to

many hundred megabytes large. It is not cached so access is rather costly.

Constant memory is a read only section of the DRAM with a maximum size

of 64kB. Each multiprocessor has a cache working set of 8kB for constant

memory. Texture memory is read only memory of arbitrary size allocated

53

from global memory. It is accessed with texture units which allow special

addressing modes and filtering. Each multiprocessor contains an 8kB cache

working set for texture memory. Instruction memory is not visible to the

programmer but it is implemented as cached DRAM. In the case of the

Geforce 8800 Ultra the global memory is GDDR3 partitioned in 6 parts and

each partition provides a 64bit interface yielding a 384bit combined inter-

face. The memory clock on Ultra cards is 1080MHz the memory of the GTX

is clocked at 900MHz. The GPU is connected to the host system via PCIe

1.1 16x which allows data transfer rates of up to 4000MB/s.

3.3.2 Programming Model

The CUDA programming model provides means of partitioning a given data

parallel problem so that it can be computed efficiently on a CUDA device.

The problem as a whole is seen as the grid. The grid contains blocks and

each block contains threads. Figure 3.6 visualizes this concept.

Figure 3.6: CUDA programming model

A thread can be understood as a thread of execution as is known in

modern operating system. However CUDA threads are light weight, they

have almost no overhead, switching between them is very cheap and there

is no need for a stack. A thread is identified by its thread ID inside a

thread block. A thread block is a one- two- or three-dimensional array

54

1 vo i d mulVolCPU ( f l o a t ∗ v , f l o a t f )2 {3 f o r ( uns i gned s h o r t z=0; z <8; z++)4 f o r ( uns i gned s h o r t y=0; y<8; y++)5 f o r ( uns i gned s h o r t x=0; x<8; x++)6 v [ z ∗8∗8 + y∗8 + x ] ∗= f ;7 }

Listing 3.2: Multiply volume CPU code

of threads. Threads in a block can share data through the aforementioned

shared memory and can synchronize their execution through synchronization

points. The number of threads in a block is limited to 512. However blocks

of the same size can be batched together into a grid of blocks. Threads

from different blocks can not communicate or synchronize with each other.

A block is identified inside the grid by a block ID. A grid can also be a two

dimensional array as seen in figure 3.6. The program running on the GPU

is called a kernel. The programmer can choose grid dimensions and specify

the block size and inside the kernel each thread can be identified by block

and thread ID. This programming model is called ’parallel thread execution’

or PTX. It is also a generic virtual instruction set and virtual machine for

data parallel problems. The PTX is described in detail in [40].

A code example makes the programming principle clearer. Consider a

volume v comprised of floating point values implemented as a 3 dimensional

array with each dimension having the size 8. The challenge is now to effi-

ciently multiply each element of the array with a factor f. Listing 3.2 gives

an example implementation of how to solve this problem in C.

The function mulVolCPU takes a pointer to the volume and a factor as

argument. Inside the function 3 nested loops iterate over every dimension

of the volume. The dimensions are defined as x , y and z. Each voxel

is addressed consecutively and multiplied by the specified factor. There is

room to optimize this code by using vectors instead of scalars utilizing single

instruction multiple data processor extensions and a multithreaded approach

could further speed up the solution if multiple CPU cores are available. For

the sake of clarity and simplicity it is left that way.

Listing 3.3 shows how this problem could be solved using CUDA and a

modern Nvidia graphics card. As in the CPU case a function mulVolGPU

is given that takes pointers to the volume and a factor as argument. It is

55

1 vo i d mulVolGPU( f l o a t ∗ v , f l o a t f )2 {3 dim g r i d ( 8 ) ;4 dim th r e ad s ( 8 ) ;5 mulVolKerne l<<<g r i d , th r eads >>>(v , f ) ;6 }78 g l o b a l vo i d mulVo lKerne l ( f l o a t ∗ v , f l o a t f )9 {

10 uns i gned s h o r t z = b l o c k I d . x ;11 uns i gned s h o r t x = th r e a d I d . x ;12 f o r ( uns i gned s h o r t y=0; y<8; y++)13 v [ z ∗8∗8 + y∗8 + x ] ∗= f a c t o r ;14 }

Listing 3.3: Multiply volume GPU code

assumed that ∗v is a pointer to the volume residing in GPU global memory.

In line 3 and 4 the size of the grid is defined: 8 blocks with each block

containing 8 threads. Then the kernel mulVolKernel() is started - the grid

dimensions are assigned in the angle brackets and the parameters are passed

in parenthesis.

What happens now is that in each block 8 threads and an overall of 8

blocks are started yielding 8 ∗ 8 = 64 threads. Those threads are spread

across all multiprocessors and essentially run at the same time. Each thread

will now define two variables z and x. z is set to blockId .x a predefined

variable specifying the ID of the block the thread runs in and x is set to

threadId .x, the id of the thread inside the block. Those variables are now

used as the z and x coordinate to address the voxels. The following loop

iterates only over y and multiplies each voxel with the factor as z and x

are defined implicitly by block and thread IDs. In essence on the GPU 64

threads run in parallel and multiply 64 voxels at the same time compared to

one voxel at a time for the CPU example. The programmer uses the block

and thread IDs to select and address the memory each thread works on.

In that CUDA is different from parallel programming languages such as

High Performance Fortran [31]. In HPF the programmer is able to specify

how to distribute memory in a top down manner using its DISTRIBUTE

directive whereas in CUDA the various threads access their data as they need

it and has therefore more of a bottom-up approach. Data distribution is not

specified explicitly and allows for more dynamic memory access. In CUDA

56

multithreading is implemented directly as each kernel is started multiple

times in various threads depending on the grid parameters.

3.3.3 Execution Model

A thread execution manager is responsible for generating threads and grid

blocks based upon the parameters specified in the kernel calls. Thread

blocks are then serially distributed to the multiprocessors. Each block is

guaranteed to execute entirely on one multiprocessor so that the shared

memory space resides on the same physical chip to guarantee fast memory

access. Depending on the number of registers and the size of the shared

memory space one thread block requires, it is possible to execute up to

8 thread blocks or 768 threads at a time on one streaming multiprocessor.

This is achieved by assigning every thread block time slices. If thread blocks

are finished and terminate, the thread execution manager can schedule other

blocks to run on the multiprocessor. The granularity of scheduling is defined

by warps - they are the scheduling units of the multiprocessors. A warp is

a batch of threads belonging to the same block. The warp size of the G80

chip is 32 threads. Warps are executed in a SIMD fashion meaning every

thread in a warp executes the same instruction but operates on different

data. As a streaming multiprocessor contains 8 stream processors, it takes

4 clock cycles to dispatch the same instruction to all threads in a warp.

The streaming processor maintains the thread IDs and schedules the

thread execution. Scheduling operations work as follows: the streaming

multiprocessor fetches a warp instruction from the instruction cache and

places it into a free instruction buffer slot. A scoreboarding mechanism

identifies instructions in the buffer that are ready to run. A warp is ready

to run if all its required values are deposited in registers. The warps are

then scheduled with a round robin method whereas the scoreboarding mech-

anism prevents any hazards that might occur when reordering instructions

by assigning instruction priorities. A decoupling of memory and processor

pipeline is achieved by this execution model. The memory pipeline can work

on fetching values for a warp while the processor pipeline executes instruc-

tions of another warp. So it is in fact important to launch more threads than

available stream processors as this makes it possible to hide memory access

latency. To quantify how well memory access latency can be hidden the

occupancy of a kernel can be calculated. It is defined as the ratio of concur-

57

rent threads for a kernel on one streaming multiprocessor to the maximum

number of threads supported by one streaming multiprocessor. The number

of concurrent threads is a function of the number of threads in one thread

block because each multiprocessor has a limitation on the number of warps it

can execute, the maximum active registers per thread and the shared mem-

ory requirements of a thread block. For memory bound kernels, increasing

the occupancy can speed up its execution by hiding memory access latency.

However for computation bound kernels forcefully increasing occupancy can

result in reduced performance due to side effects such as register spilling to

off-chip memory.

Branching is handled by serializing the different execution paths. This

can result in inefficient execution. However for small branches the com-

piler is able to create predicated instructions - a very efficient way to avoid

branching. Instructions are predicated by conditions - if the condition is

true the instruction is executed; NOP is executed if the condition is not

true. Mahlke et al. [33] contains a comprehensive discussion of instruction

predication.

3.3.4 Memory Model and Access

Registers are assigned to blocks and can not be shared between them and

each thread in a block only accesses registers assigned to itself. Shared

memory is also assigned to blocks and is only accessible by that block. Due to

the parallel nature of the architecture, efficient simultaneous access to shared

memory is possible as shared memory on each streaming multiprocessor is

divided into 16 memory banks. The programmer has to avoid bank conflicts

- simultaneous access to the same memory bank from different threads of

one half warp. Bank conflicts can be circumvented by aligning memory

fields and devising a reasonable memory access pattern. Constant memory

is a very efficient way to access values that are common for all threads in a

block, for example mathematical constants or other kernel parameters.

Global memory access is very slow as it is basically uncached DRAM.

However memory reads by consecutive threads in a warp can be combined by

the hardware into several, wide memory reads which are a lot faster than ran-

dom reads. The requirement is that the threads in the warp must be reading

memory in order. More precisely a thread number N in a half warp should

access address HalfWarpBaseAddress + N and HalfWarpBaseAddress should be of

58

type type ∗ with sizeof (type) equal to 4, 8 or 16. HalfWarpBaseAddress should

be aligned to 16∗ sizeof (type) bytes. The same applies to memory reads. An

efficient programming pattern is therefore to fetch the data a thread block

needs to operate on into shared memory in a consecutive manner, then op-

erate on that data and finally write it back in the efficient way described

before. Shared memory can thus be seen as a programmable shared cache.

Newer CUDA devices with compute capability 1.2 or higher do not have

those restrictions. These devices are able to perform coalesced memory

access as soon as threads from the same half-warp access memory within

one segment of global memory. One segment can be up to 128 bytes wide.

The access pattern however does not influence weather or not memory reads

are coalesced.

The hardware background to this optimization is based on the bus width

of the memory subsystem of the GPU. It is between 384bit for first gener-

ation CUDA devices and in newest models up to 512bit wide. Fetching a

32bit floating point value takes a certain time, usually 400-600 GPU cycles,

however fetching 12 32bit floating point values (384bit) that are located

next to each other in global memory takes the same amount of time. The

memory subsystem always fetches the full bus width and discards the data

that was not requested. By following aforementioned alignment rules one

can always exploit the full potential of the memory bus and thus speed up

memory access.

3.3.5 CUDA Toolchain

The CUDA toolchain allows developers to write code for the graphics card

in C with some additional CUDA specific syntax. Kernels can be started

from the host program with function calls. The CUDA compiler automati-

cally adds functions to upload kernel instructions to the GPU. The CUDA

framework provides C functions to allocate memory on the GPU and to

copy data from and to the GPU. The toolchain integrates nicely into Visual

Studio by Microsoft. Additionally the framework contains FFT and BLAS

libraries.

The CUDA framework allows developers to mix conventional C/C++

code running on the CPU with CUDA specific kernels and function calls

that operate on the GPU. To accomplish this, the toolchain detects CUDA

specific code, extracts it from the source and compiles it with a proprietary

59

compiler. Conventional C/C++ code is passed to a user defined compiler

(Micorosft Visual C MSVC, Intel ICC or GCC). The CUDA kernels are then

injected as load images into the object files along with routines to upload

them to the GPU. In the linking step additional runtime libraries are added

to support the aforementioned C functions to allocate memory and upload

data and to start kernels. When compiling kernels, PTX code is generated.

From this intermediate ISR code the compiler generates device specific code

which resemble the load images injected into the object files. The abstract

PTX code is generated as an intermediate step because devices differ in com-

pute capability, for example newer GPUs support double precision floating

point computation or atomic operations.

3.3.6 Debugging CUDA code

CUDA provides an emulation mode to debug kernels. After recompiling

kernels with the parameter deviceemu set, instead of running on the GPU

the kernel is executed on the CPU. All threads that would normally run

in parallel on the graphics card run on the host system sequentially. This

enables the developer to set breakpoints, examine variables and read out

memory. Even output to the screen is possible from inside the kernel. A

side effect of this method is that computation is not actually performed on

the GPU so the emulation mode is not useful for locating errors, for example

examining the differences in floating point calculations. The threads run in

succession which might hight race condition errors in the code or other

concurrency related problems.

Nvidia also provides a port of the GNU Project Debugger GDB for

CUDA code. This tool allows realtime debugging of code running on the

graphics card. This is useful to debug a kernel without side effects caused by

emulation. It allows developers to stop execution at any line in the kernel

code, step through kernels, read out current device memory and switch

between blocks as well as threads.

In some cases it might also be necessary to examine what instructions

the compiler generates from the C code to identify bottlenecks and detect

causes for errors. For this purpose the third party tool decuda by Wladimir

J. van der Laan is available [51]. It’s a disassembler for the generated kernel

binary files. It allows developers to not only examine the intermediate PTX

code but also the actual code that is executed by the GPU.

60

3.4 Implementing projectors in CUDA

The following chapter describes how the algorithms for forward- and backward-

projection are implemented in CUDA to run on a CUDA capable graphics

card.

3.4.1 Requirements

The hardware running the projection algorithm has to meet certain require-

ments for the projectors to produce the same numerical results as the CPU.

Furthermore there are basic requirements the CUDA hardware has to fulfill

to match the efficiency of the CPU algorithm.

The first question is weather or not graphics devices have enough mem-

ory capacity to store the data structures that are required for reconstruc-

tion. The minimal datasets required for the projection of one sinogram

view are the sinogram view itself and two images, one for input or out-

put and the other to store intermediate calculations, more precisely the

sheared image. Assuming 32bit floating point values for each voxel and

pixel and a typical scanner geometry of 55 crystal rings with 7 segments,

span 11 and a maximum ring difference of 38 one sinogram view with 559z

and 336ρ samples is 0.72MB large and the 2 images would be together

2∗366ρ∗366ρ∗109z = 111.4MB large. For span 1 reconstruction a sinogram

could have up to 2769z samples yielding 3.6MB per sinogram view. This

is required for some PSF reconstruction algorithms [44]. Modern graphics

cards have memory banks multiple times larger than these requirements. Ta-

ble 3.2 shows memory specifications for a number of selected Nvidia graphics

cards from the last three GPU chip generations and proofs that a GPU recon-

struction algorithms would not be constricted by available memory (sources:

[55], [36], [37]).

Device Memory Memory Interface Memory Bandwidth

GeForce 8800 GTX 768MB 384bit 86.4GB/sGeForce 8800 Ultra 768MB 384bit 103.7GB/sTesla C870 1536MB 384bit 77GB/sGeForce 9800 GX2 2x500MB 512bit 2x64GB/sGeForce GTX 295 2x896MB 2x448bit 2x111.9GB/s

Table 3.2: Selected Nvidia cards memory specification

61

The second important requirement is the numerical reproduction of the

CPU projector results on the GPU. Therefore it has to be determined if the

device architecture supports floating point operations similar to the CPU

and in particular if it supports the IEEE 754 floating point standard. The

algorithms operate on single precision floating point numbers. Therefore

only single precision operations have to be validated.

According to [41] accumulation and multiplication operations are stan-

dard compliant. However the compiler might merge ADD and MUL to a

combined multiply-add MAD instruction. This instruction cuts off the re-

sult of the multiplication which might lead to inaccurate results. But the

compiler can be forced to generate separate multiply and add instructions.

Division is specified as to have a maximum error of 2ulp. Trigonometric

functions sin and cos are specified with maximum errors of 2ulp. These

variances lie within the boundaries of the acceptable error.

To conclude the basic requirements analysis it is determined that it is

possible to port the CPU projector algorithm to the graphics device as both

computational accuracy and available memory are adequate.

3.4.2 Implementation

The projector implementation is based on the optimization principle from

listing 3.3 on page 56. In the example the parallel capabilities of the GPU

are exploited to simultaneously perform operations on a large part of an

array. This parallelism can be used in the projection algorithm because its

basic operation is manipulating large arrays. This is discussed in section 3.1.

So the algorithm is implemented in a way that allows an optimization of the

calculations by parallelization. This idea is illustrated in figure 3.3 for the

backprojector: during one view projection the algorithm iterates multiple

times over an image and manipulates it voxel by voxel. This subsection

describes in detail how the projector algorithm is implemented in CUDA.

The forward-projector can be ported by parallelizing the outer loops

of the algorithm and assigning them to thread-blocks and threads. Figure

3.7(a) shows the first part of the forward projector algorithm, the calculation

of the segment 0 projection and the calculation of the sheared image. The

sinogram and one y-slice of the image is shown. The algorithm has to iterate

over the entire image volume and calculate the linear interpolation between

two neighboring voxels on the x axis. The figure shows the interpolation in

62

x direction with the line integral in y direction. Two arrows converging in

one pixel of the sinogram symbolize one interpolation. The volume indices

are accessed by the different parallelization elements. Along the z -axis all

threads from one thread-block are allocated. The x axis is handled by

different blocks. The threads calculate the interpolation for one y image

slice, save it to the sheared image and add it to the correct sinogram bin.

They loop over the y axis and perform the same calculation for each image

slice. In effect an entire image slice is processed simultaneously. So for each

voxel in the x/z plane there exists a thread divided in blocks along the x

axis and each thread iterates over the y axis. For interpolation along the y

axis with the line integral in x direction the process is the same apart from

exchanging the x and y dimensions.

(a) Calculation of segment 0, sheared im-age

(b) Projection of segments 6=0

Figure 3.7: CUDA implementation of the forward projector

The projection of segments 6= 0 is shown in figure 3.7(b). Here the

oblique angles are calculated which are always independent of θ or the di-

rection of the line integral. It is an interpolation in z direction calculated

from the sheared image, which is the image that already has incorporated

the rotation θ around the axial axis of the scanner. The interpolation has to

be calculated multiple times with different oblique angles for each segment

6= 0. Thus one thread loops over y as well as all the oblique segments. For

each iteration over the oblique segments the data is the same however, the

angle and thus the interpolation coefficients change. Due to the angle, the

number of interpolations per sinogram segment is different. Each thread

adds the interpolated values to the correct sinogram pixel in the correct

63

sinogram segment.

The backward projector is the inverse operation of the forward projec-

tor. First the backward projector has to calculate the sheared image from

the sinogram, and then the image is unsheared. The calculation of the

sheared image during backward-projection is identical to the calculation of

the oblique sinograms during forward projection. During calculation of the

sheared sinogram the z axis corresponds to the threads, the x axis separates

the blocks from each other and each thread loops over y and the number of

segments. Each thread calculates one voxel of the sheared image.

(a) Unshearing of image step 1 (b) Unshearing of image step 2

Figure 3.8: Unshearing of image during backprojection

The unshearing of the image is split up into two separate operations.

Depending on the direction of the line integral the interpolation has to be

calculated in x or y direction. The CPU algorithm iterates over the sheared

image, calculates the interpolation and adds the two interpolation results

to the unsheared image. Parallelizing this operation is not straight forward.

Assuming that there is one thread for each element of an image slice, for

each iteration over the direction of the line integral the thread would write

to two voxels in the output image. A thread with the same thread ID in

a neighboring thread block would also write to two voxels in the output

image but due to the nature of interpolation they would both try to write

to the same voxel. Two different threads from two different blocks trying to

modify the same memory results in undefined behavior if the read-modify-

write operation is not atomic. There is no way to synchronize threads from

64

different thread blocks and atomic operations are not available on all CUDA

capable devices and are very costly due to serialization of the commands.

A way to work around this issue is illustrated in Figure 3.4.2. The kernel is

split up into two separate functions, the first function calculating the first

part and the second function calculating the second part of the interpolation.

By doing so writing to the same memory location at the same time is not

possible, as each thread only writes to one single image voxel.

The particular arrangement of threads, blocks and loops to the image

coordinates is sensible because the image is indexed in the order zxy . Map-

ping the z coordinate to threads allows them to consecutively access mem-

ory. This increases the performance of the kernel as the hardware is able

to combine consecutive memory access of threads in a warp to one single

wide memory access that is a lot faster (compare section 3.3.4). The blocks

are mapped to the x or y coordinate depending on the direction of the line

integral or in the case of oblique segment calculation the blocks always rep-

resent the y coordinate. The remaining coordinate and the segments are

handled by loops inside a thread.

Each of the steps described above is implemented as a separate kernel

function. Additionally there are separate kernels for the different line inte-

gral directions along the x or y axis. So for one projection method there are

9 kernel functions required including the Joseph’s scaling kernel. Table 3.3

shows a brief overview of all required kernels and their function.

Name Function

fwd-x Project segment 0 and sheared image (y direction)fwd-y Project segment 0 and sheared image (x direction)fwd-ob Project segments 6= 0bwd-ob Project all segments into sheared imagebwd-x1 Unshearing first part of interpolation (x direction)bwd-x2 Unshearing second part of interpolation (x direction)bwd-y1 Unshearing first part of interpolation (y direction)bwd-y1 Unshearing second part of interpolation (y direction)joseph-scaling Joseph’s method scaling

Table 3.3: Projector kernel functions

The main reason for splitting up the kernels like this is synchronization.

The only synchronization possible during kernel execution is a call to a

special kernel function syncthreads() that synchronizes all threads of one

65

1 cudaMal loc ( ( vo i d ∗) &image , s i z e o f ( f l o a t ) ∗ imag e s i z e ) ;2 cudaMemset ( image , 0 , s i z e o f ( f l o a t ) ∗ imag e s i z e ) ;3 cudaMemcpy ( image , cpu image , s i z e o f ( f l o a t ) ∗ image s i z e ,4 cudaMemcpyHostToDevice ) ;5 cudaMemcpy ( cpu image , image , s i z e o f ( f l o a t ) ∗ image s i z e ,6 cudaMemcpyDeviceToHost ) ;7 cudaFree ( image ) ;

Listing 3.4: CUDA memory operations

block. Global synchronization however is not possible within the runtime

of one kernel. Therefore the logical parts of the projector that require a

previous operation to be finished entirely before the next operation can be

started are split up into separate functions.

Due to the large amounts of memory available on CUDA devices it is

possible to transfer large chunks of data in one single transfer to the device

and then calculate multiple projections. In particular, devices can hold all

the data required for the projection of one entire subset, which consists

of a series of sinogram views (the number of views depends on how many

subsets are used), an image and a sheared image. Example calls to allocate,

set, copy and free memory are given in listing 3.4.

The image pointer is modified by the cudaMalloc call to point to mem-

ory on the GPU and is not valid for CPU operations. The memory set

and memory copy functions use this pointer to identify a GPU memory ad-

dress. The last parameter of cudaMemcpy specifies whether data should be

downloaded or uploaded from or to the device. cudaFree finally makes the

memory available again for further allocations.

3.4.3 Optimization

To make full use of the parallel capabilities of CUDA devices a number of

optimization techniques are applied. By optimizing the CUDA kernels the

GPU reconstruction is no longer only a port from CPU to GPU but is a

new implementation of the same algorithm.

One straight forward idea to optimize the algorithm is register reduction.

It can speed up the kernel execution for two reasons. The first reason is oc-

cupancy. The more register a kernel uses, the less blocks can run simultane-

ously on one streaming multiprocessor as the blocks on one multiprocessor

66

1 /∗ r e g i s t e r imp l ementa t i on ∗/2 f l o a t k = d img [ z + x ∗ ZSAMPLES +3 i y y ∗ RHOSAMPLES ∗ ZSAMPLES ] ;4 f l o a t h = d img [ z + x ∗ ZSAMPLES +5 ( i y y +1) ∗ RHOSAMPLES ∗ ZSAMPLES ] ;6 a = k + (h−k ) ∗ pc ;78 /∗ sha r ed memory imp l ementa t i on ∗/9 s h a r e d f l o a t sha r ed [ ] ; // d e f i n e on CPU

10 con s t uns i gned i n t z = th r e a d I d x . x ;11 sha r ed [ z ] = d img [ z + x ∗ ZSAMPLES +12 i y y ∗ RHOSAMPLES ∗ ZSAMPLES ] ;13 sha r ed [ z+ZSAMPLES ] = d img [ z + x ∗ ZSAMPLES +14 ( i y y +1) ∗ RHOSAMPLES ∗ ZSAMPLES ] ;15 a = sha r ed [ z ] + ( sha r ed [ z+ZSAMPLES]− sha r ed [ z ] ) ∗ pc ;

Listing 3.5: Register and shared memory implementation of interpolation

share all its resources. Simultaneous block execution however is used to hide

memory latencies when fetching data from global memory. If not enough

blocks are active the memory latencies can not be hidden and the perfor-

mance of the kernel decreases. In extreme cases where a lot of registers are

required, the compiler may generate instructions to offload registers to locale

memory. This is known as register spilling. The performance of the kernel

suffers drastically if registers are spilled to local memory as it is unbuffered

off-chip DRAM. To reduce the number of required registers two techniques

are applied in the projection kernels. The first idea is to replace registers

by shared memory. This however depends on the shared memory usage, the

register usage and the occupancy of a kernel. Occupancy is a function of

how many resources a kernel uses. The resources are registers and shared

memory. If a kernel uses either too many registers or too much of the avail-

able shared memory the occupancy goes down. The projectors generally use

a lot of registers - up to 12 per thread and not so much shared memory - only

up to 2kB of the existing 16kB. Therefore some of the registers are moved

to shared memory to even the resource consumption of the kernels. Listing

3.5 shows two implementations of an interpolation: one using registers and

the other using shared memory to store intermediate results.

Using the shared memory implementation the compiler can reduce the

amount of required registers from 12 to 10 resulting in 100% occupancy as

opposed to 83% when using 12 registers. Another method to reduce the

67

1 /∗ uns i gned i n t con s t an t memory d e f i n e s ∗/2 #de f i n e RHOSAMPLES 03 #de f i n e ZSAMPLES 14 #de f i n e K NUMUI 256 /∗ d e f i n e a r r a y wi th i m p l i c i t a l l o c a t i o n on GPU ∗/7 c o n s t a n t uns i gned i n t u i [K NUMUI ] ;89 /∗ d e f i n e pa ramete r s on CPU and copy to GPU ∗/

10 kpu i = ( uns i gned i n t ∗)11 mal l oc ( s i z e o f ( uns i gned i n t ) ∗ K NUMUI ) ;12 kpu i [RHOSAMPLES] = RhoSamples ;13 kpu i [ZSAMPLES] = ZSamples ;14 cudaMemcpyToSymbol ( u i , kpui ,15 K NUMUI ∗ s i z e o f ( uns i gned i n t ) ) ;1617 /∗ use i n k e r n e l on GPU ∗/18 s = shea r [ x ∗ u i [ZSAMPLES] + y ∗ u i [RHOSAMPLES ] ] ;

Listing 3.6: Usage of constant memory to reduce register usage

number of registers is constant memory. It is a 64kB large part of global

memory and each multiprocessor has 8kB dedicated cache available. Only

read operations are possible when using constant memory, however read ac-

cess on a cache hit is as fast as register access. Therefore it is used to make

constants and parameters available to the kernel. The projector kernels use

two arrays of constant memory, one storing floating point constants, the

other one containing integer constants. The constants are used by access-

ing elements of the array via predefined indices. Listing 3.6 illustrates this

principle by making the image dimensions available to kernels via constant

memory. Global constants are defined to access elements of the array, the

memory on the GPU is allocated using the constant keyword, the pa-

rameters are specified on the host and are copied to the GPU where they

can be used in the kernel. A more elegant method would be to use one

single structure, however as of CUDA version 1.1 the compiler has problems

generating correct code when structures are stored in constant memory due

to the variable lengths of the members.

The second optimization principle is related to memory. Memory access

is by nature very costly and often constraints the performance of memory in-

tensive algorithms. Memory access on modern graphics cards takes between

400 and 600 cycles compared to 1 cycle for register access. Therefore it is

68

important to reduce memory reads and writes by optimizing the code to use

caches and to reduce memory access in general. This can often be achieved

by bundling many small memory accesses into one large bulk access. An-

other technique is to hide memory latency by rearranging instructions or

executing threads in parallel. While some threads wait for a memory fetch

to finish, other threads might use the GPU to calculate results thus not

waisting any GPU time while waiting.

GPUs support all those optimization techniques. Some are built in hard-

wired parts of the device architecture. For example latency hiding by ex-

ecuting other threads while others are blocking on memory is part of the

GPUs integrated functionality. The thread execution manager takes care of

the optimal arrangement of threads and memory access. All the program-

mer has to do in this matter is to start enough threads so that there are

threads ready to run while others wait. The metric to determine if enough

threads are started is occupancy.

Section 3.3.4 describes a way to access memory in a way so that the

hardware can optimize memory access by fetching large junks of memory at

once instead of accessing only small amounts with multiple separate memory

round trips. This is an important optimization technique and can drastically

speed up kernel execution. Even if the threads of one block do not have

a memory access pattern that can be optimized, with the help of shared

memory this can be circumvented. The idea is that all threads in one block

copy the data they operate on to shared memory before starting any cal-

culations. The threads can work together to fetch the data in a coalesced

manner that can be optimized by the hardware. After this the threads

synchronize to ensure that all threads have finished fetching data from the

memory and the shared memory is populated with the required data by call-

ing syncthreads(). They can then start their calculations using the data

in shared memory. Shared memory access can be up to 100 times faster

than uncoalesced global memory access [41]. After the threads finish their

calculations the same principle can be applied to write back data. Threads

write their intermediate and final results to shared memory and after every

thread is finished calculating they issue coalesced memory writes.

Listing 3.7 shows a typical code section taken from the fwd-ob kernel

function that populates shared memory before calculation. Each thread

fetches one image voxel from the sheared image; one thread-block fetches

69

1 con s t uns i gned i n t z = th r e a d I d x . x ;2 s y n c t h r e a d s ( ) ;3 sha r ed [ z ] = shea r [ z + x∗ZSAMPLES +4 y∗ZSAMPLES∗RHOSAMPLES ] ;5 s y n c t h r e a d s ( ) ;

Listing 3.7: Populating CUDA shared memory

an entire z row. Each thread can now calculate the interpolation (compare

Figure 3.8(b)) along the z axis using the values from shared memory. The

results are also stored in shared memory and written to the sinogram in a

similar fashion after all calculations are finished.

Data transfers between host system and graphics device result in over-

head that is unique to CUDA algorithm. To speed up these transfers the

CUDA API offers special function calls, to allocate and use so called pinned

memory. Pinned memory is page locked memory that can not be moved

or paged out from RAM. The CUDA driver is able to track page locked

memory as its address never changes. If memory transfers are initiated be-

tween pinned memory and the CUDA device the driver is able to perform

direct memory access (DMA) copies between the device and the memory.

On memory transfers between regularly allocated memory and the CUDA

device the memory is copied to small CUDA private pinned memory buffers.

This involves additional CPU time and an extra copy operation which de-

grades performance. Listing 3.8 shows the function that wraps the CUDA

call to allocate pinned memory inside a C function.

In some cases using pinned memory can worsen overall system efficiency

because it lessens the amount of available memory. However during PET

reconstruction all data structures have to fit into user memory anyway to

avoid swapping and guarantee optimal performance. Therefore the recon-

struction systems are equipped with enough memory so that CUDA pinned

memory can be used without any drawbacks.

3.4.4 Results

First tests are done using a workstation test system with two dual core Intel

Xeon processor clocked at 2660MHz, 8GB of RAM, the Intel S5000PSL

server mainboard (8x PCIe) and a Gefoce 8800 Ultra clocked at 1512MHz.

The reconstruction is calculated using the unweighted OSEM algorithm.

70

1 e x t e r n ”C” f l o a t ∗2 CUDA mallocPinned ( uns i gned long i n t s i z e )3 {4 f l o a t ∗ t ;5 cudaMal locHost ( ( vo i d ∗∗) &t , s i z e ) ;6 r e t u r n t ;7 }

Listing 3.8: Wrapper function for pinned memory allocation

Three iterations with 21 subsets are calculated. The sinogram dimensions

are 336x336x559, the final image dimensions are 336x336x109.

The first results are encouraging. Figure 3.9 shows the runtime of the

reconstruction using different setups and optimization. Table 3.4 describes

the meaning of the reconstruction setups and shows the speedup

Sp =Tprevious

Tcurrent(3.10)

of each configuration compared to the previous one. The CPU setups im-

plement the same algorithm as the existing validated software. The ”GPU”

setup is a naive implementation with no optimizations applied. It is imple-

mented to proof the feasibility and as a reference point to benchmark the

optimizations used. In the ”GPU: opt1” setup the same optimization that is

used on the CPU is used for the GPU: the index is reshuffled from xyz to zxy

for faster continuous data access. The ”GPU: opt2” setup contains the opt1

optimization and utilizes the pinned memory functions of the CUDA API as

described above. In the ”GPU: opt3” in addition to the opt2 optimizations

all kernels are modified to use coalesced memory accesses where possible.

The GPU reconstruction with best optimizations applied takes only 48% of

the time of the validated reconstruction algorithm used in current produc-

tion systems.

Figure 3.10 shows the absolute error sum, the average error and the

maximum error when comparing the reconstructed image to a reference

image reconstructed using existing validated software. The scale of the y-

axis is logarithmic. It shows a small error for the two CPU implementations

with a error sum of 0.00015. The error of the GPU implementations is

higher yet constant for all setups. The error sum is 0.05, with an average

of 5.99 ∗ 10−9 and a maximum error of 3.71 ∗ 10−6. This is well within

71

CPU: 1 thread CPU: 4 threads GPU GPU: opt1 GPU: opt2 GPU: opt30

50

100

150

200

250

300

Setup

Run

time

in s

BackprojectionForwardprojectionOverall Reconstruction Time

Figure 3.9: Runtime comparison of different setups

Setup Description Sp

CPU: 1 thread one CPU thread (one core) is used -CPU: 4 threads 4 CPU threads (all four cores) are used 2.14GPU GPU with simple implementation (no optimization) 1.19GPU: opt1 GPU with optimizations: index reshuffle xyz to zxy 1.33GPU: opt2 GPU with optimizations: opt1 and pinned memory 1.11GPU: opt3 GPU with optimizations: opt2 and coalesced access 1.19

Table 3.4: Setup description and speedup

acceptable boundaries.

3.4.5 Other algorithms

Apart from the projectors themselves other kernels are created to support

the projector kernels. As mentioned earlier coalesced memory access is

very important. To meet the requirements for coalesced memory access

of a two-dimensional array as it is used in the kernels with for example

width ∗ blockId .y + threadId.x, the width has to be a multiple of 16 [41]. The

width of one dataset however is typically 120 (109 padded to 120). So a

preprocessing step is added to the projectors that pads the dataset width

72

CPU: 1 thread CPU: 4 threads GPU GPU: opt1 GPU: opt2 GPU: opt3

10−10

10−8

10−6

10−4

10−2

100

Setup

Err

or

Error SumError AverageError Max

Figure 3.10: Absolute error comparison of different setups

from 120 to 128 and after projection the padding is removed. Figure 3.11

shows the difference between a padded and an unpadded array in an 128

elements wide memory.

Figure 3.11: Unpadded and padded memory

In the upper part of the image the unpadded array is displayed. The

width of the array does not match the width of the memory, it is 8 elements

too short. As a result the second dimension of the array always starts at

at different memory addresses. For the CUDA device to access this data

efficiently the x dimension always has to start at the first address of a mem-

ory block as illustrated in the lower part of the image. At the end of each

z dimension 8 empty elements are added to pad out the array. Listing 3.9

shows a kernel that can pad and unpad a three-dimensional image volume.

73

1 g l o b a l vo i d padunpadVolume ( f l o a t ∗ in ,2 f l o a t ∗ out , un s i gned s h o r t i n t inDim ,3 uns i gned s h o r t i n t outDim ,4 uns i gned s h o r t i n t xy )5 {6 con s t uns i gned i n t x = b l o c k I d x . x ;7 con s t uns i gned i n t z = th r e a d I d x . x ;89 f o r ( uns i gned s h o r t i n t y=0; y<xy ; y++)

10 out [ z+y∗outDim+x∗outDim∗ xy ] =11 i n [ z+y∗ inDim+x∗ inDim∗ xy ] ;12 }

Listing 3.9: Kernel to pad and unpad 3 dimensional arrays on the GPU

Input and output arrays are passed to the kernel as well as their respec-

tive dimensions. The kernel then pads or unpads the input array depending

on inDim and outDim by iterating over the input and output volumes using

blocks for the x dimension and threads for the z dimension.

Another additional algorithm is implemented in a kernel for index reshuf-

fling. It efficiently reorders the indices from xyz to zxy , the required index

order for the projection kernels. The reshuffling implementation is compa-

rable to the CUDA Matrix Transpose SDK Example [35] where the indices

of a two dimensional array are exchanged. The implementation is based on

the idea of using shared memory as a buffer and reading and writing data

as efficiently as possible from and to memory. The algorithm is illustrated

in Figure 3.12. It shows one half of a 4x4x4 volume on the left side indexed

by xyz , in the middle the intermediate buffer or shared memory and on the

right side the reshuffled image.

Data is accessed as coalesced as possible in global memory and read

and written scattered to and from shared memory as the access to global

memory has to follow strict rules to be efficient, whereas shared memory

operations can be scattered and are still faster. All blocks in one thread

execute coalesced reads in x direction, write the data to shared memory and

then loop to their next z dimension where again each thread reads data and

writes it to shared memory. After each thread filled up its part of the shared

memory which is ensured by a call to syncthreads (); the threads start to

write back data to the new array again in a coalesced manner. They scatter

their access to shared memory so that they can execute fully coalesced write

74

Figure 3.12: Index reshuffling on CUDA device

operations to global memory.

3.4.6 Texture Units

The texture units are an integral part of every graphics processing unit. On

modern G80 chips they support implicit interpolation when accessing data

in 2D and 3D textures with different interpolation methods. In conjunction

with texture caches they provide a very effective way to access memory and

have interpolation calculated implicitly. These powerful resources would be

ideal for the implementation of projectors but are not exploited in this work

for mainly two reasons. First, one of the main goals of this work is to create

a proof of concept that reconstruction algorithm implementations can be

ported to the GPU and that both CPU and GPU implementations calcu-

late the same results. Therefore a more ”literal” port from CPU to GPU is

preferred over implementing an entirely new algorithm. The implicit inter-

polation of the texture units are largely out of the control of the programmer

and can thus not be tuned easily to produce the same results as the CPU in-

terpolation algorithm. Second, at the time this algorithms are implemented

there is no 3D texturing support for CUDA. Now with CUDA version 2.2

3D textures are supported. It is therefore suggested as one of the next steps

of this work to prototype an implementation utilizing the texture units of

the GPU. It is likely that another significant speedup on top of the speedup

of the current GPU implementation might be achieved if CPU results can

75

be reproduced using the texture units.

3.5 Debugging and Validation

One crucial requirement of this work is to reproduce the results of the recon-

struction process used in PET products today. An in-depth validation of the

new implementations are therefore necessary. Also debugging methods and

tools are required to validate the reconstruction process step by step and to

quickly detect, locate and resolve errors. For these tasks a number of special

purpose utilities are created and used next to already existing products.

3.5.1 Debugging

For low level debugging the Visual Studio 2005 built in debugger is used in

conjunction with the CUDA emulation mode described in subsection 3.3.6.

For numeric evaluation of the reconstructed image the hex editor XVI32

[32] is used. It allows easy navigation through the raw image data and

automatically interprets the data as IEEE 754 single precision floating point.

Figure 3.13 shows the hex editor in action.

Figure 3.13: Hexeditor

Another debugging tool is an application called SinogramViewer. It is

specifically developed for the objective of this work and provides a number

of important features for debugging and development. The following is an

incomplete list of the features of the SinogramViewer tool.

• side by side comparison of sinograms and images with 3D axis selection

for images and angle selection for sinograms

76

• sinogram rotation and index reshuffling

• step by step control of reconstruction steps instrumentalizing the re-

construction binaries for online side by side data comparison during

reconstruction

• support for multiple image and sinogram sizes

• animated image rotation (3D view)

• data comparison (sinogram and image)

• test data creation algorithms (sphere, cylinder, cube)

• C# CPU projectors

The tool is a C# .NET application that is developed and enhanced over

time. It starts out as a simple sinogram and image viewer, provides an envi-

ronment for first tests with projector algorithms and evolves into a powerful

debugging tool with many different features. Figure 3.14 shows the large im-

age and sinogram loading part of the applications with 2 sinograms loaded.

The bottom slider allows scrolling through the different view angles of the

sinograms. Furthermore, a CPU projector can be started to backproject

a specified view angle of both sinograms. The difference between to two

sinograms can be calculated as well.

Another tool used is VINCI, a commercial product ”designed for the

visualization and analysis of volume data generated by medical topographic

systems with special emphasis on the needs for brain imaging with Positron

Emission Tomography” [53]. It allows among other things the simultaneous

in depth examination of multiple images, plotting of profiles and image

arithmetic.

3.5.2 Validation

For validation a flexible test script and a command line tool for image com-

parison is created. The script allows an automated run of reconstruction

tools with various input data and parameters as well as the automated val-

idation of the reconstruction result. It is implemented as a windows batch

script. Validation is done with the help of precalculated reference reconstruc-

tion results. These are created with the same input data and parameters

77

Figure 3.14: Sinogram Viewer Tool

78

1 SET VALBIN=rawdf2 SET TESTDESC=(Bra in , Large , UW−OSEM)3 SET TESTCASENAME=BrainLUW4 SET PARAMTEST=−−a l go uw−osem −− i s 3 ,28 −− f l − l 7 3 , .5 −−f o r c e −−gpu6 SET PARAMGOLD=−−a l go uw−osem −− i s 3 ,28 −− f l − l 7 3 , .7 −−f o r c e8 REM d e l e t e e x i s t i n g f i l e s9 d e l /Q %TESTOUTDIR%%TESTCASENAME%\∗

10 d e l /Q %GOLDOUTDIR%%TESTCASENAME%\∗1112 REM c r e a t e f i l e s13 %TESTBIN% −e %INDIR%%INFILE%14 −−o i %TESTOUTDIR%%OUTFILE% %PARAMTEST%15 %GOLDBIN% −e %INDIR%%INFILE%16 −−o i %GOLDOUTDIR%%OUTFILE% %PARAMGOLD%1718 REM compare f i l e s19 %VALBIN% %TESTOUTDIR%%OUTVOLUME%20 %GOLDOUTDIR%%OUTVOLUME%

Listing 3.10: Testscript excerpt

as the test case but the validated reconstruction process is used to create

them. The test is successful if the reference images can be reproduced by the

new reconstruction process. An alternative that is also possible is to provide

reference binaries. Those are proven binaries of a validated reconstruction

system. Both the test binaries and the reference binaries are executed with

the same parameters and input data. After both binaries finished recon-

struction the results are compared and analyzed. Listing 3.10 shows a small

excerpt of a test script that uses a reference and a test binary to reconstruct

a brain scan using the unweighted OSEM algorithm.

To compare and analyzed the results a command line tool called rawdiff

is created to calculate the difference between two reconstructed images. It is

implemented as C++ console application. It takes two paths and filenames

to the images as input and compares those two images. It can be called from

within the test script and its output is displayed on the screen or written to

a log file for automated test run protocols.

The tool calculates five different metrics to compare the two images and

thereby determines the accuracy of the result. The metrics are the absolute

error Eabs , the relative error Erel , the mean squared error Emse , the root

79

mean squared error Ermse and the normalized mean squared error Enmse .

The absolute error

Eabs(x , y , z ) = λ(x , y , z ) − λgold (x , y , z ) (3.11)

is used to determine the overall accuracy of the calculated result compared

to the reference image. The comparison tool outputs the maximum absolute

error max (Eabs) and the mean absolute error

Emeanabs(x , y , z ) =1

xyz∗

zmax∑

z=0

ymax∑

y=0

xmax∑

x=0

(λ(x , y , z ) − λgold (x , y , z )). (3.12)

The second metric is the relative error

Erel (x , y , z ) =λ(x , y , z ) − λgold (x , y , z )

λgold (x , y , z )(3.13)

which is used to determine the number of decimal places that match. Again

the maximum max (Erel ) and the mean

Emeanrel (x , y , z ) =1

xyz∗

zmax∑

z=0

ymax∑

y=0

xmax∑

x=0

(λ(x , y , z ) − λgold (x , y , z )

λgold (x , y , z )

)(3.14)

relative errors are displayed by the comparison tool. The number of inaccu-

rate decimal places n is calculated by

n = log10

(Erel

eps

)(3.15)

with eps = 6∗10−8 for single single precision floating point numbers accord-

ing to IEEE 754. Additionally the mean squared error and its normalized

form are used as proposed in [1]. The MSE is defined as

Emse(x , y , z ) =1

xyz∗

zmax∑

z=0

ymax∑

y=0

xmax∑

x=0

(λ(x , y , z ) − λgold (x , y , z ))2. (3.16)

The root mean squared error as a good indicator of accuracy is often used to

compare the discrepancy between images that can diverge. It is calculated

80

1 rawdf .\ r e con . i . . \ . . \ dev\ data \ image 00 . v −−abs2 compar ing .\ r e con . i w i th . . \ . . \ dev\ data \ image 00 . v3 8339557 e r r o r s found ;4 abs sum : 0.04994055 abs av : 5 .98839 e−0096 abs max : 3 .70748 e−006 at adr 12259472

Listing 3.11: rawdiff tool output

by

Ermse(x , y , z ) =√

Emse(x , y , z ). (3.17)

The normalized MSE is calculated by first normalizing the images relative to

their intensities with λ(x , y , z )norm = λ(x , y , z )/µ whereas µ is the intensity

of the image λ.

Listing 3.11 shows the example output for the comparison of a recon-

structed image. The absolute error sum, average and maximum is displayed

as well as the address of the maximum error.

Different test cases include different input data sizes, different input data

content (e.g. wholebody scans, brain scans, mathematical data such as a

sphere), different input data sizes and different reconstruction parameters

and modes such as unweighted reconstruction, attenuation corrected recon-

struction, LOR space or PB space reconstruction.

Figure 3.15 shows four of the used test datasets. The first image is a

reconstruction of a neck scan from the nose to the center of the sternum.

The second image shows the scan of a phantom. A phantom is a device filled

with radiating substances that can be placed into PET systems. Depending

on the phantom type they simulate different regions of a patient’s body.

The third image is a full body patient scan and the bottom image shows a

reconstructed image of a uniform phantom. All images have the contrast

enhanced for better visibility and are reconstructed without attenuation

correction.

3.6 Product considerations

The GPU reconstruction algorithms developed are planned to be used in the

PET Reconstruction System - a headless workstation-like computer running

81

Figure 3.15: Testimages

82

Windows XP 64bit with 8GB of RAM and 2 multicore processors. The new

version of the reconstruction system should contain a CUDA capable graph-

ics device for GPU image reconstruction. A series of tests are devised to

determine which CUDA card should be used, whether or not the CUDA

device can be used as display adapter and for reconstructing PET sino-

grams at the same time and which mainboard should be selected for the

system. The tests can be categorized in three different groups: memory

tests, reconstruction tests and stability tests. The memory tests are used

to compare the CUDA device memory subsystems. Its performance is an

essential factor for image reconstruction runtimes. Additionally a number of

tests to benchmark the host to device data transfer rates are executed. The

reconstruction benchmarks are implemented to determine the performance

of the different system configurations running real world reconstruction al-

gorithms. The stability tests are done to determine how the systems behave

under constant load.

For all memory tests the data packet size was in the range from 10MB

to 200MB staring from 10MB and gradually increasing in 10MB steps. For

the analysis only the timings for 10MB, 50MB 100MB and 200MB transfer

sizes were taken under consideration. They represent different data packets

relevant to reconstruction like the full sinogram, the sinogram subset, the

image or a large image. For the reconstruction tests the algorithm was full

3D UW-OSEM. 3 iterations and 21 subsets were calculated. The dataset is

taken from a real brain scan, the sinogram dimensions are 336x336x559, the

final image dimensions are 336x336x109.

The mainboard choice is narrowed down by availability, price and com-

patibility with processors and RAM. The two mainboards to choose from

are the server board Intel S5000PSL [21] and the workstation board Intel

S5000XVN [22]. The list of CUDA capable cards available at that time for

the reconstruction system are the Gefoce 8800 Ultra clocked at 1512MHz, a

special version of the Gefoce 8800 Ultra clocked at 1663MHz, the Tesla C870

a dedicated general purpose GPU without video output and the Geforce

8800 GTX. In the cases where the tested CUDA device was not the display

adapter (always for the Tesla card), a Nvidia Quadro NVS 290 was used

for video output. Due to driver incompatibilities the card used as display

adapter has to be compatible to the card used for reconstruction so they

both have to be CUDA capable cards.

83

1 cudaEventCreate ( ) ; // c r e a t e an even t2 cudaEventRecord ( ) ; // i n s e r t even t i n t o st ream3 cudaEventSynch ron i z e ( ) ; // s y n c h r o n i z e to even t4 cudaEventElapsedTime ( ) ; // c a l c u l a t e t ime d i f f e r e n c e5 cudaEventDest roy ( ) ; // d e s t r o y an even t

Listing 3.12: CUDA timing events

3.6.1 Timing the Tests

Timing CUDA operations whether they are kernel calls or other function

calls is not a trivial task. The problem is after starting a kernel from the host

the function call immediately returns. Before CUDA version 1.0 one had to

force the calling context to wait for the CUDA function to return which it

did by busy waiting. With newer CUDA versions Nvidia introduced event

mechanisms. The entire program execution is seen as a stream into which

events can be inserted. Between events the elapsed time can be calculated

after synchronizing to the last event to make sure the stream execution

has reached it. Listing 3.12 lists the functions available for timing CUDA

executions.

To simplify the use of the CUDA event system a library is created that

wraps the CUDA event system functionality in a transparent manner. It pro-

vides the function cudaEvent(char ∗ id , char ∗ tag) ; which handles the event

creation and recording. It takes an id to identify the event inserted into

the stream and also allows tagging of events to group them together. The

function cudaEventFinish() finishes all timing operations by synchronizing on

the last events and then calculates the time difference between consecu-

tive events using cudaEventElapsedTime(). Additionally the time differences

for events with the same tags are added up, and an average is calculated.

cudaEventFinish() then returns a structure containing the calculated timing

information. The library can also handle multiple threads which is nec-

essary if multiple CUDA devices are used together because their contexts

should be spread across different threads. The library is able to distinguish

between threads and CUDA contexts and can log their timing information

separately.

84

3.6.2 Memory Tests

The main difference between the two mainboards is the width of their PCI

Express interconnects. The PCI Express bus of the server board has 8

interconnect lanes whereas the workstation board has 16. The transfer rate

of each bus lane is 250MB/s for PCI Express 1.0 [8]. Hence the theoretical

transfer rate of the workstation board is twice the rate of the server board.

During reconstruction a significant amount of data has to be transfered from

host to device and back: during forwardprojection the image is uploaded

and the sinogram is downloaded, during backward projection the sinogram

is uploaded and the image is downloaded. Depending on the number of

subsets and the iterations the amount of data transfered differs, but when

assuming a regular sized sinogram and 3 iterations with 21 subsets the data

transferred amounts to 2970MB. Therefore the transfer rate of the bus has

an influence on the reconstruction time.

10 50 100 2001

1.5

2

2.5

3

3.5

Datasize in MB

Tra

nsfe

r R

ate

in G

B/s

PCIe 8x pageablePCIe 8x pinnedPCIe 16x pageablePCIe 16x pinned

Figure 3.16: Host to device PCI Express 8x and 16x pinned and pageable

Figure 3.16 shows a comparison of the transfer rate of the Gefoce 8800

Ultra (1663MHz) used in the two different mainboards and using pageable

or pinned memory. The first two graphs (PCIe 8x pageable and PCIe 8x

pinned) show the transfer rate of the card working in the Intel S5000PSL.

From the graphs one can conclude that it does not matter whether pageable

or pinned memory is used when using only an 8 lane wide PCI Express

bus. The transfer rate is hitting the maximum possible transfer rate of the

85

bus which effectively is about 1.5GB/s. Theoretically it should be 2GB/s

(8x250MB/s). The other two curves (PCIe 16x pageable and PCIe 16x

pinned) show the device operating in the Intel S5000XVN. It can be seen

that for pageable memory the maximum speed is about 2GB/s where the

maximum theoretical bandwidth is 4GB/s. In this case the memory copies

from the allocated memory to the dedicated area of pinned memory reserved

for CUDA host to device transfers that is performed by the CPU, limits

the transfer rate. The maximum transfer rate of 3GB/s is achieved when

using pinned memory and a 16 lanes wide bus. It again does not reach the

maximum theoretical rate of 4GB/s but is twice as fast as the PCIe 8x bus

and is thus consistent with the other measurements if it is assumed that the

device can reach 75% of the possible transfer rates.

10 50 100 2001,4

1,6

1,8

2

2,2

2,4

2,6

2,8

3

3,2

Datasize in MB

Tra

nsfe

r R

ate

in G

B/s

Gefoce 8800 Ultra (1512MHz) PCIe 8xTesla C870 PCIe 8xGefoce 8800 Ultra (1512MHz) PCIe 16xTesla C870 PCIe 16x

Figure 3.17: Transfer rate for PCI Express 8x and 16x (pinned)

Figure 3.17 shows that this is true for the Gefoce 8800 Ultra (1512MHz)

and the Tesla C870 cards as well. The timings are taken from pinned mem-

ory transfers. Devices connected via PCI Express 8x bus are able to transfer

data about half as fast as the devices using the 16 lane bus.

Figure 3.18 shows the maximum host to device memory transfer rates

for all available cards. The Tesla C870 is the fastest card overall when com-

paring host to device copies and the Geforce 8800 GTX performs worst.

The difference however is only marginal and a difference of 0.4GB/s trans-

fer rate does not significantly influence the runtime of the reconstruction

86

10 50 100 2002,97

2,98

2,99

3,00

3,01

3,02

3,03

3,04

Datasize in MB

Tra

nsfe

r R

ate

in G

B/s

Gefoce 8800 Ultra (1663MHz)Gefoce 8800 Ultra (1512MHz)Tesla C870Geforce 8800 GTX

Figure 3.18: All cards maximum host to device performance

algorithms. Yet the difference between 8x and 16x PCI Express transfer

rates when using pinned memory is significant. The Intel S5000XVN board

allows CUDA devices to transfer data at almost twice the rate compared to

the Intel S5000PSL.

Figure 3.19 shows a comparison of the on-device memory transfers of all

available CUDA cards. It shows the Geforce 8800 Ultra (1663MHz) with its

memory clock at 1125MHz to be the fastest device with a maximum transfer

rate of 87GB/s. The second fastest card with 81GB/s is the same type of

card with lower clock frequency, the Geforce 8800 Ultra (1512MHz). The

Geforce 8800 GTX can copy memory on device at 71GB/s and the Tesla

C870 at 65GB/s. The card’s maximum transfer rate is directly proportional

to their respective memory clock frequencies.

As a result of the memory test it is recommended to equip the new

Reconstruction System with the Intel S5000XVN workstation board due to

its 16 lanes wide PCI Express bus which enables the system to transfer

data from the host to the device at twice the rate compared to the 8x bus.

The fastest CUDA device is the Geforce 8800 Ultra (1663MHz) which is the

recommended device after this test.

87

10 50 100 20060

65

70

75

80

85

90

60

65

Datasize in MB

Tra

nsfe

r ra

te in

GB

/s

Gefoce 8800 Ultra (1663MHz)Gefoce 8800 Ultra (1512MHz)Tesla C870Geforce 8800 GTX

Figure 3.19: On device transfer rates

3.6.3 Reconstruction Tests

The reconstruction tests show which system configuration is able to recon-

struct PET images the fastest. Table 3.5 shows the overall reconstruction

runtime in seconds sorted in ascending order for a 3D UW-OSEM 3 itera-

tions and 21 subset reconstruction calculating a 336x336x109 image from a

336x336x559 sinogram. Figure 3.20 shows the same information in a graph

for better comparison. The PCIe label designates the mainboard used for

the test. In the case of the PCIe 8x the Intel S5000PSL was used, in case of

PCIe 16x the test system contained the Intel S5000XVN mainboard. The

fastest system is comprised of the Gefoce 8800 Ultra clocked at 1663MHz,

the PCIe 16x mainboard and an additional graphics card for display. Only 3

seconds slower is the same configuration with the Geforce card handling both

reconstruction and display. The slowest card is the Tesla C870 performing

worst when operating in either mainboard.

The margin between best performing and worst performing configuration

is with almost 13 seconds significant. The graph shows that there is no big

difference in performance between the Geforce Ultra cards when ignoring

the first measurement. The influence of an additional graphics device for

display as opposed to using the same graphics device for reconstruction and

display is hard to measure. The timings range from significant influence to

almost no influence at all. During the test it is noticed that moving the

88

No. Configuration Time in s

1 Gefoce 8800 Ultra (1663MHz) PCIe 16x No Display 55.192 Gefoce 8800 Ultra (1663MHz) PCIe 16x Display 58.353 Gefoce 8800 Ultra (1512MHz) PCIe 16x No Display 58.494 Gefoce 8800 Ultra (1663MHz) PCIe 8x Display 59.395 Gefoce 8800 Ultra (1663MHz) PCIe 8x No Display 59.706 Gefoce 8800 Ultra (1512MHz) PCIe 16x Display 59.827 Gefoce 8800 Ultra (1512MHz) PCIe 8x Display 62.668 Gefoce 8800 Ultra (1512MHz) PCIe 8x No Display 63.189 Tesla C870 PCIe 16x No Display 64.8410 Tesla C870 PCIe 8x No Display 68.07

Table 3.5: Reconstruction time overview

mouse while the system reconstructs an image using the GPU and while the

same GPU is also used for display can influence the test result i.e. slow done

the reconstruction. When wildly moving the mouse across the screen the

performance of the reconstruction system decreases dramatically. Therefore

the mouse was never moved during any of the tests. When moving the

mouse, the graphics device has to recalculate the screen image. The GPU

requires resources for doing so hence those resources are not available for

reconstruction. The GPU seems do be able to dynamically allocate resources

to either computation or display on the fly.

Figure 3.21 shows the overall performance of all system configurations

split up into the separate tasks of the reconstruction process. The ”Back-

projection of 1” represents the calculation of the backprojection of an initial

sinogram filled with 1 which is done right after when the reconstruction al-

gorithm starts. This is the smallest task of the process taking from 6 to 9

seconds. The ”Backprojection” is the time the algorithm takes for all back-

projection jobs during reconstruction excluding the projection of the initial

sinogram. The ”Forwardprojection” constitutes all forwardprojection cal-

culations during reconstruction. The times for ”Other” contain all other

tasks done during reconstruction like data input, data output, calculation

of the quotient and the new image for each subset, initialization including

parameter precalculation and the calculation of the circular mask.

Figure 3.22 shows how those four tasks make up the entire reconstruction

process. 78% of the reconstruction time is spent projecting images and

sinograms backward and forward and 22% of the time is spent doing other

89

1 2 3 4 5 6 7 8 9 10

54

56

58

60

62

64

66

68

70

Configuration No.

Run

time

in s

Figure 3.20: Reconstruction time overiew

1 2 3 4 5 6 7 8 9 10

5

10

15

20

25

30

Configuration No.

Tim

e in

s

Backprojection of 1BackprojectionForwardprojectionOther

Figure 3.21: Reconstruction time detailed overiew

90

things. This image also shows that the reconstruction time can be reduced

by about 12% if the backprojection of 1 is only calculated once and reused

for all reconstructions as it is always the same process and data.

12%

34%

32%

22%

Backprojection of 1BackprojectionForwardprojectionOther

Figure 3.22: Reconstruction components

Figure 3.23 plots the runtime of the reconstruction against the maxi-

mum theoretical floating point operations per second of the hardware (bot-

tom x-axis) and the maximum theoretical data transfer rate (top x-axis).

The devices are from left to right the Tesla C870, the Geforce 8800 GTX,

the Geforce 8800 Ultra (1512MHz) and the Geforce 8800 Ultra (1663MHz).

The fastest reconstruction times for the hardware are plotted using the In-

tel S5000XVN workstation mainboard and an additional graphics device as

display adapter. Table 3.6 shows the plotted values.

No. Device s Gflop/s GB/s

1 Tesla C870 64,84 346 38,42 Geforce 8800 GTX 64,55 346 43,23 Geforce 8800 Ultra (1512MHz) 58,49 384 51,844 Geforce 8800 Ultra (1663MHz) 55,19 425 54

Table 3.6: Reconstruction time overview

91

340 350 360 370 380 390 400 410 420 430 44052

54

56

58

60

62

64

66

Rate of Execution Gflop/s

Run

time

in s

36 38 40 42 44 46 48 50 52 54 56

52

54

56

58

60

62

64

66

Transfare Rate in GB/s

Run

time

in s

Figure 3.23: Reconstruction performance transfer and execution rate

It is interesting to compare the different transfer rates and execution

rates and the resulting reconstruction runtimes. The Tesla C870 and the

Geforce 8800 GTX have the same maximum theoretical rate of execution

however the Geforce 8800 GTX is able to transfer data 5GB/s faster. The

influence of this on the reconstruction runtime is marginal. One can conclude

that the reconstruction runtime is not limited by the maximum transfer rate

of the CUDA device but by its execution rate. For this implementation the

execution rate seems to set a maximum for the performance of the algorithm.

The transfer rate specified here is calculated by multiplying the memory

bus width by the memory clock. Official numbers are twice that number

as they take the double data rate into account. The maximum execution

rate is calculated by multiplying the number of floating point ALUs with

their clock speed multiplied by two because multiply-add instructions can be

executed within one cycle. Official numbers are higher than those specified

here (e.g. 520Gflop/s for the G80 chip) as those numbers take graphics-

specific operations into account.

92

3.7 Product Integration

After the successful development of a prototype implementation of the re-

construction algorithm using CUDA and a graphics card the algorithm is

integrated into the Siemens PET reconstruction system. The following sec-

tion describes the requirements and some key implementation details. The

most important requirements are:

• loose integration so that CUDA features can be turned on or off at

compile time and at runtime

• in case of an error the reconstruction system is to fail gracefully

• the reconstruction system has to produce the same results when using

CUDA devices or the regular CPU implementation

• flexible integration so that different and or multiple CUDA devices

can be used for reconstruction including simultaneous use of multiple

cards

The loose integration requirement is met by using the preprocessor con-

stant CUDA GPU and by implementing the command line switch −−gpu for

the executable. The prepocessor constant encloses all GPU specific classes

and function calls and allows the build toolchain to compile a version of

the executable with or without GPU functionality included. The command

line switch can be specified when starting the reconstruction executable to

instruct the system to use the GPU accelerated reconstruction mode if pos-

sible.

The graceful fail requirement is fulfilled by the extensive use of C++

exception handling. CUDA functions are only called from a low level C

CUDA wrapper. All calls to CUDA built-in functions return an error code of

type cudaError t that can be checked for cudaSuccess. If the function does not

succeed the CUDA wrapper method that calls the built-in function returns

with an error. As there are a lot of calls to CUDA built-in functions, this is

simplified by using a macro that wraps all functions, checks for success and

returns if the call failed. Listing 3.13 shows that macro.

It is called like this: CUDA SAVE CALL(cudaSetDevice(GPUId)); and

all functions that contain it have to return an integer value. The

definition of a function that uses the macro might look like this:

93

1 #de f i n e CUDA SAVE CALL( c a l l ) \2 i f ( t r u e ) { \3 c u d aE r r o r t my cuda e r r o r = c a l l ; \4 i f ( my cuda e r r o r != cudaSucces s ) \5 r e t u r n my cuda e r r o r ; \6 e l s e \7 ( vo i d ) 0 ; \8 } e l s e ( vo i d )0

Listing 3.13: CUDA error check

extern ”C” int CUDA InitGPUPB(). All C++ methods from upper layers that

call low-level CUDA wrapper functions have to check the return value of

those functions. If the return value does not indicate successful execution of

the called function the C++ methods throws an exception. This exception

bubbles up within the reconstruction system and reaches already existing

mechanisms to deal with exceptions including logging mechanisms.

The third requirement, the reconstruction system has to produce the

same results when using GPU or CPU is achieved by a very literal port of

the CPU algorithm to the GPU. Extensive testing and comparison of the

reconstructed images ensures that the requirement is met.

The fourth requirement defines how the GPU projectors should be inte-

grated into the reconstruction system. The CPU projectors start one thread

for each CPU core in the system. It is sensible to extend this principle to

GPU projectors. For every GPU core (there are graphics devices available

that contain 2 GPU cores) a single thread is used. This allows the simul-

taneous use of multiple cards and increases expandability. A second card

could be simply plugged in to the system and the reconstruction system

could use two GPUs for reconstruction.

3.7.1 Integration Overview

The existing reconstruction system is a complex C++ application (>100.000

lines of code). The GPU extensions are integrated with care so that only

small portions of existing code need to be changed. Additionally software

engineering methods used in the existing product are applied to the GPU

extensions as well. In that the class model for the different projectors is

copied and changed to work with GPUs. The GPU projectors have to access

low level C functions to work with the GPU. Those functions are part of a

94

CUDA wrapper that exports its functions via extern ”C” int functionname();

and makes them available for C++ classes to use. Figure 3.24 is the class dia-

gram for the projector system with GPU extensions included. The OSEM3D

class is the class that contains the OSEM reconstruction algorithm. It cre-

ates an object prj3D depending on what kind of projector is required for the

current reconstruction algorithm. An identical interface to all different pro-

jectors is achieved by using a base class Prj3D Sheared that provides virtual

methods for the actual projectors to implement.

This class model is extended in the following way: the base class

Prj3D Sheared takes a parameter bool gpu in the constructor that indicates

whether GPU projectors should be used or not. In the case of GPU pro-

jection the GPU projector initializes all the same as the CPU projector

which is important because GPU and CPU projectors require the same set

of parameters. But in addition to initialization the class creates an ob-

ject of GPU Prj3D Sheared which is organized the same way as Prj3D Sheared.

Only the base class for the projectors is moved to a separate class called

GPU Prj3D Base. It is again a base class that provides virtual methods for

the actual projectors to implement.

If the prj3D object is asked to project data it checks if the GPU flag is

set. If it is set it tries to use its member prjGPU to project the data. If it is

not possible for the GPU to reconstruct the data due to memory limitations

the system falls back to CPU projectors. Based on reconstruction parame-

ters the system calculates how many resources the projector requires. Based

on that information the system searches for CUDA capable GPUs and de-

termines if there are GPUs available that meet the requirements. The CUDA

wrapper provides a function CUDA GetGPUProperties(tGPUProperty ∗ const Prop, int GPUId)

that queries all required information from the selected GPU such as global

memory size, shared memory size, maximum grid and block dimensions, the

major and minor device revision number and the clock rate.

3.7.2 Multithreading Implementation

A multithreaded approach is necessary to being able to utilize multiple GPUs

at the same time and it makes it easier to run other algorithms or even CPU

projectors in parallel to the GPU projectors. The latter is possible because

the CUDA framework is designed so that threads yield if they are blocking

because they have to wait for GPU operations to finish. According to CUDA

95

Figure 3.24: Class diagram

96

documentation, threads will block after 16 consecutive kernel calls because

of a full queue that stores kernel calls or after a memory operation is issued.

The threading concept implements a controller-worker principle. A con-

troller thread dispatches workers threads that perform the calculations. For

each GPU one worker thread is created. This is because each thread is as-

signed a different GPU context, a CUDA API concept that allows a process

or thread to specify which card it works with. The controller threads sig-

nals the worker thread to start working and the worker thread signals the

controller thread as soon as it is finished.

Figure 3.25: Multithreading

Figure 3.25 shows this principle. The main thread or controller thread

creates a worker thread and can then execute other operations. The worker

thread will initialize and wait (blocking) for a signal from the main thread

to start projection. If the main thread decides to start the projection it will

first assign the projection parameters to the thread and then signal it to

start. The main thread can then do other things such as scheduling other

projections using another GPU or even CPU. The main thread would do

97

that by signaling yet another thread. As soon as the main thread started

all projections and finished its operations it will wait for the worker thread

by blocking on a signal. As soon as the worker thread finished projection

it will signal the main thread to indicate that the projected data is ready

to be consumed. The worker thread will then block again for the signal

from the main thread to start yet another projection. For sake of clarity the

figure does not contain the necessary synchronization point after the worker

thread is correctly initialized. This is necessary because otherwise the main

thread would try to access variables of the worker that are not initialized

yet. There is also no exit condition shown in the figure. It is implemented

so that the main thread is able to set a flag that signals the worker thread

to exit. The main thread is able to start any number of worker threads.

Currently the thread and signaling mechanisms are implemented using

POSIX Threads, a POSIX standard for thread implementations. A thread

is started using pthread create () and closed using pthread join (). The sig-

naling and synchronization is currently implemented with 2 semaphores

hat are locked and released by either threads using pthread mutex lock() and

pthread mutex unlock(). This implementation proved to be difficult to main-

tain so an alternative is suggested that better implements the signaling

metaphor.

A signaling class signal that contains only one mutex, a POSIX Threads

condition variable, a boolean signal variable and a unsigned integer waiting

variable is suggested that contains two methods void wait (); and void send(); .

Both methods lock a mutex upon entering to guarantee mutual ex-

clusion. This is necessary because both methods access and modify the

same variables and are called from different threads. Additionally the

pthread cond wait() and pthread cond signal () functions require the mutex to

be locked. For the wait function there are two cases to consider after it

locked the mutex. In the first case the sender already signaled the receiver.

If so the receiver can go ahead without interruption. If however there is no

signal, the waiting thread indicates that one more thread is waiting for the

signal by incrementing the waiting member. After that pthread cond wait() is

called which handles two things: first it transfers the thread into a waiting

state until the signal condition variable is set. Second, the mutex is unlocked

as long as the thread is blocked. The send method considers two states as

well. If no thread is waiting, the signal variable is simply set to indicate the

98

1 vo i d s i g n a l : : wa i t ( ){2 p th r e ad mut e x l o c k (& mutex ) ;3 i f ( s i g n a l ) // s i g n a l a l r e a d y r e c e i v e d4 s i g n a l = f a l s e ;5 e l s e { // wa i t f o r s i g n a l6 wa i t i n g++; // count wa i t i n g t h r e ad s7 // r e l e a s e the mutex and b l o ck the th r ead8 p th r e ad cond wa i t (& cond , & mutex ) ;9 // he r e the mutex i s l o c k ed aga in

10 wa i t i n g −−;11 }12 p th r ead mutex un l o ck (& mutex ) ;13 }1415 vo i d s i g n a l : : send ( ){16 p th r e ad mut e x l o c k (& mutex ) ;17 i f ( w a i t i n g < 1) // no t h r e ad s wa i t i n g18 s i g n a l = t r u e ;19 e l s e // s i g n a l wa i t i n g t h r e ad s20 p t h r e a d c o n d s i g n a l (& cond ) ;21 p th r ead mutex un l o ck (& mutex ) ;22 }

Listing 3.14: Signaling methods

signal. If a thread is waiting and thus blocking on the condition variable,

the method sends the signal.

The sequence diagram in figure 3.26 shows initialization, projection and

shutdown of the projector code for parallel beam (PB) projectors. After ini-

tialization of the projector object the number of available GPUs is detected

and their properties are readout. The method TestGPUCapacityPB() checks

whether or not one or more GPUs fit the requirements for the selected re-

construction mode. If one or more GPUs are found a thread is started, the

thread is assigned the correct GPU context and the GPU is prepared for

reconstruction: memory is allocated on the GPU and for transfer on the

host system, grid and block dimensions are calculated, constant memory is

prepared. After the GPU is setup successfully the projectors are ready and

the OSEM algorithm can use them. The diagram shows a call to backproject

a sinogram. The sequence ends with a call to the destructor of the projector

classes.

99

Figure 3.26: Sequence Diagram

100

3.7.3 Parallel CPU and GPU projection

The GPU projectors are integrated into the product so that they can eas-

ily be modified to support parallel execution of CPU and GPU projection

threads. A scheduling algorithm could be implemented that spreads the

work across all system resources (CPUs and GPUs). At the time of im-

plementation it is difficult to speed up the reconstruction using both CPUs

and GPUs because using one GPU degraded system performance to a de-

gree. Basically one thread is always busy waiting when the GPU is running.

In newer versions of the CUDA framework the thread yields, thus allowing

other threads to occupy the CPU. Additionally the schedule granularity of

one view of the CPU projectors also worsens the efficiency of CPU and GPU

projectors operating in parallel. Taking a ratio for reconstruction times be-

tween one GPU and all CPU cores of 12 as a basis which means that the

CPU cores require twice as much time to project a given data set as the

GPU to project the same data a parallel execution could take 13 less time.

Taking into account the original speed gain of GPU over CPU of 12 , the

parallel execution only accounts for 16 less time. This example assumes no

additional overhead for a parallel implementation. The speedup for parallel

utilization of GPU and CPU projectors is highest if CPU and GPU take

roughly the same amount of time to project. Therefore projectors currently

don’t use CPU and GPU at the same time but use exclusively either one or

the other. Instead of focusing on the challenge of speeding up reconstruc-

tion with projectors running on both computing resources at the same time

more effort is put into trying to recreate CPU results using the GPU for all

different projector methods including LOR and span-1 projectors for PSF

reconstruction.

3.7.4 Hybrid Implementation

The reconstruction system can be called a hybrid system employing both

CPU and GPU computational resources because both system components

are used during reconstruction and for different purposes. The GPU is used

to calculate all projections that are required for reconstruction and some

minor tasks that can be efficiently implemented and optimized for GPU

use. An example for this is the index reshuffling of images before and after

projection. All other tasks such as the actual OSEM calculation, optimiza-

101

tion and correction methods such as normalization and scatter correction

are executed on the CPU. Also some parameters that are required by the

projectors are calculated by the CPU beforehand. Important parameters

that effect the numerical stability of the projectors are even calculated in

double precision.

102

Chapter 4

Conclusion

In this thesis an algorithm for reconstructing images from positron emission

tomography sinograms were ported from the CPU to a modern graphics

processing unit using the CUDA framework. The algorithm was extended

to fit the GPU programming model and optimized for speed to utilize the

GPU to its full potential. The results produced from the GPU implemen-

tation are numerically identical to those calculated by the CPU. The final

and intermediate results can thus be interchanged, which allows a seamless

integration of the GPU routines into existing reconstruction systems as well

as a hybrid implementation that utilizes both CPU and GPU.

The results of this work not only help to speed up the reconstruction

process and thereby potentially enhance the clinical workflow and improve

the patient throughput. They also show that data parallel problems can

be solved efficiently using modern GPU technology. With the introduction

of new frameworks those powerful platforms have been made available not

only to graphics experts but also to scientists and developers of numerical

algorithms. Frameworks such as CUDA and BrookGPU allow for the quick

development of algorithms on GPUs.

As GPUs are a mass market product they are a comparably cheap al-

ternative to high-end multicore multi-CPU workstations. Top-of-the line

GPUs usually sell for about $500 to $700. This allows clinics to upgrade

regular workstations to reconstruction systems simply by installing a rela-

tively cheap graphics device in the system. This would allow radiologists

to reconstruct data on their workstations with different parameters and al-

gorithms somewhat detached from the regular clinical workflow. This gives

103

radiologists the chance to reevaluate interesting scans and run special pur-

pose reconstructions. With physicians having more tools at their hands this

might ultimately result in a more accurate diagnosis for the patient.

The next steps to take are to try to create a hybrid system that extends

the functional decomposition of the problem domain as it is implemented

now to a true domain space decomposition utilizing the computing power

of both CPU and GPU for projection. This was not sensible in the current

system as the CPU was too slow to contribute any major speedup to the

process but is definitely an area where the current system could be improved.

Another option is to use multiple GPUs in one single system parallelize the

problem space across those devices. This would introduce another layer of

parallelization: all GPUs could work on one reconstruction problem by for

example distributing projections across views or it would be also possible to

run multiple reconstructions at the same time.

Additionally other time consuming algorithms that are required for PET

reconstruction could be ported to the GPU. The filtering routines that are

applied to the sinograms during PSF enabled PET reconstruction for ex-

ample could be efficiently implemented using a CUDA device. It is even

possible to implement the entire OSEM algorithm on the GPU.

And also other projectors that require even more processing power than

those currently implemented could be ported to GPUs such as the time

of flight projectors. In time of flight data acquisition the location of the

positron emission is measured more accurately by considering the time dif-

ference it took the two gamma rays to travel towards the detectors. This

effectively results in one more dimension in the sinogram datasets and thus

increases the size of the input data and requires a more complex projection

algorithm which could be effectively implemented on GPUs.

Other possibilities for future developments would be a move away from

CUDA towards OpenCL. Use of OpenCL would allow the execution of al-

gorithms on both CPU and GPU platforms. Apart from this advantages it

is also an open industry standard which ensures code reuse across hardware

vendors. This would allow PET system manufacturers to choose from all

vendors and not restrict themselves to e.g. only Nvidia.

The existing code base could also be refactored to use LOR projectors for

both PB and LOR reconstruction. This is possible because PB projectors

are only a subset of LOR projectors. Doing so would greatly simplify the

104

maintainability of the code.

The introduction of modern GPUs into PET products used in the regular

clinical workflow has given developers a powerful new tool that is easy to

use thanks to frameworks such as CUDA and OpenCL. It enables them to

fulfill the demands made by physicians and clinics for high quality images

and short reconstruction times.

105

List of Abbreviations

CPU Central Processing Unit

CT Computed Tomography

CTM Close to Metal

CUDA Compute Unified Device Architecture

FDG Fluorodeoxyglucose

FLOP Floating Point Operation

FOV Field of View

GPGPU General-Purpose computation on GPUs

GPU Graphics Processing Unit

HPC High Performance Computing

LES Linear Equation System

LOR Line of Response

ML-EM Maximum-Likelihood Expectation Maximization

mrd maximum ring difference

MSE Mean Squared Error

NMSE Normalized Mean Squared Error

OSEM Ordered Subset Expectation Maximization

PB Parallel Beam

PET Positron Emission Tomography

106

PTX parallel thread execution

SIMD single instruction multiple data

SPECT Single photon emission computed tomography

SSE streaming SIMD extensions

ULP Unit of Least Precision

107

List of Figures

2.1 Decay and Annihilation exemplified by 189F . . . . . . . . . . 19

2.2 Functional diagram of cyclotron . . . . . . . . . . . . . . . . . 21

2.3 Chemical structure of Glucose compared to Fluorodeoxyglucose 22

2.4 Schematic layout of scintillation detector block . . . . . . . . 24

2.5 True and Scattered positron events . . . . . . . . . . . . . . . 24

2.6 Random and Multiple coincidences . . . . . . . . . . . . . . . 25

2.7 Uncertain and Attenuated positron events . . . . . . . . . . . 25

2.8 Difference between 2D and 3D data acquisition mode . . . . . 27

2.9 Efficient data structuring . . . . . . . . . . . . . . . . . . . . 28

2.10 The PET coordinate system . . . . . . . . . . . . . . . . . . . 29

2.11 Oblique LORs are not parallel . . . . . . . . . . . . . . . . . . 29

2.12 Comparison of LOR and PB space projection . . . . . . . . . 30

2.13 The span concept for spans 3, 5, 7 and 9 . . . . . . . . . . . . 31

2.14 Segments 1, 0 and -1 for span 9 and mrd 13 . . . . . . . . . . 32

2.15 Michelogram of a 55 ring, 38mrd, span 11 3D PET system . . 33

2.16 Comparison of analytic and iterative reconstruction algorithms 35

2.17 Projection of positron emissions at θ = 90◦ into sinogram row 37

2.18 Basic ML-EM algorithm . . . . . . . . . . . . . . . . . . . . . 40

2.19 Basic OSEM algorithm . . . . . . . . . . . . . . . . . . . . . . 41

3.1 Projector using sheared image . . . . . . . . . . . . . . . . . . 46

3.2 CPU projector efficiency analysis . . . . . . . . . . . . . . . . 48

3.3 Back-projector structogram . . . . . . . . . . . . . . . . . . . 50

3.4 Comparison of original and modified implementation . . . . . 51

3.5 Geforce 8800 architecture . . . . . . . . . . . . . . . . . . . . 53

3.6 CUDA programming model . . . . . . . . . . . . . . . . . . . 54

3.7 CUDA implementation of the forward projector . . . . . . . . 63

3.8 Unshearing of image during backprojection . . . . . . . . . . 64

108

3.9 Runtime comparison of different setups . . . . . . . . . . . . 72

3.10 Absolute error comparison of different setups . . . . . . . . . 73

3.11 Unpadded and padded memory . . . . . . . . . . . . . . . . . 73

3.12 Index reshuffling on CUDA device . . . . . . . . . . . . . . . 75

3.13 Hexeditor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

3.14 Sinogram Viewer Tool . . . . . . . . . . . . . . . . . . . . . . 78

3.15 Testimages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

3.16 Host to device PCI Express 8x and 16x pinned and pageable 85

3.17 Transfer rate for PCI Express 8x and 16x (pinned) . . . . . . 86

3.18 All cards maximum host to device performance . . . . . . . . 87

3.19 On device transfer rates . . . . . . . . . . . . . . . . . . . . . 88

3.20 Reconstruction time overiew . . . . . . . . . . . . . . . . . . . 90

3.21 Reconstruction time detailed overiew . . . . . . . . . . . . . . 90

3.22 Reconstruction components . . . . . . . . . . . . . . . . . . . 91

3.23 Reconstruction performance transfer and execution rate . . . 92

3.24 Class diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

3.25 Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . 97

3.26 Sequence Diagram . . . . . . . . . . . . . . . . . . . . . . . . 100

109

List of Tables

2.1 Properties of positron-emitting atoms (reproduced from [5]) . 18

2.2 Properties of commonly used scintillators (reproduced from [5]) 20

3.1 CUDA Memory Overview . . . . . . . . . . . . . . . . . . . . 53

3.2 Selected Nvidia cards memory specification . . . . . . . . . . 61

3.3 Projector kernel functions . . . . . . . . . . . . . . . . . . . . 65

3.4 Setup description and speedup . . . . . . . . . . . . . . . . . 72

3.5 Reconstruction time overview . . . . . . . . . . . . . . . . . . 89

3.6 Reconstruction time overview . . . . . . . . . . . . . . . . . . 91

110

Listings

3.1 SSE memory allocation . . . . . . . . . . . . . . . . . . . . . 47

3.2 Multiply volume CPU code . . . . . . . . . . . . . . . . . . . 55

3.3 Multiply volume GPU code . . . . . . . . . . . . . . . . . . . 56

3.4 CUDA memory operations . . . . . . . . . . . . . . . . . . . . 66

3.5 Register and shared memory implementation of interpolation 67

3.6 Usage of constant memory to reduce register usage . . . . . . 68

3.7 Populating CUDA shared memory . . . . . . . . . . . . . . . 70

3.8 Wrapper function for pinned memory allocation . . . . . . . . 71

3.9 Kernel to pad and unpad 3 dimensional arrays on the GPU . 74

3.10 Testscript excerpt . . . . . . . . . . . . . . . . . . . . . . . . . 79

3.11 rawdiff tool output . . . . . . . . . . . . . . . . . . . . . . . . 81

3.12 CUDA timing events . . . . . . . . . . . . . . . . . . . . . . . 84

3.13 CUDA error check . . . . . . . . . . . . . . . . . . . . . . . . 94

3.14 Signaling methods . . . . . . . . . . . . . . . . . . . . . . . . 99

111

Bibliography

[1] Comparative evaluation of visualization and experimental results using

image comparison metrics, Washington, DC, USA, 2002. IEEE Com-

puter Society.

[2] ATI. CTM Guide - Technical Reference Manual. Web

site: http://ati.amd.com/companyinfo/researcher/documents/

ATI_CTM_Guide.pdf, Last accessed: 05/06/2008.

[3] MS Atkins, D. Murray, and R. Harrop. Use of transputers in a 3-D

Positron Emission Tomograph. IEEE transactions on medical imaging,

10(3):276–283, 1991.

[4] B. Bai and AM Smith. Fast 3D iterative reconstruction of PET images

using PC graphics hardware. In IEEE Nuclear Science Symposium

Conference Record, 2006, volume 5, 2006.

[5] Dale L. Bailey, David W. Townsend, Peter E. Valk, and Michael N.

Maisey, editors. Positron Emission Tomography: Basic Sciences.

Springer, 1 edition, 4 2005.

[6] N. Bohr. On the constitution of atoms and molecules, Part 1, Binding of

electrons by positive nuclei. Philosophical Magazine, 26(1):1–24, 1913.

[7] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston,

and P. Hanrahan. Brook for GPUs: stream computing on graphics

hardware. ACM Transactions on Graphics, 23(3):777–786, 2004.

[8] R. Budruk. PCI express system architecture. Addison-Wesley Profes-

sional, 2003.

112

[9] B. Cabral, N. Cam, and J. Foran. Accelerated volume rendering and

tomographic reconstruction using texture mapping hardware. In Pro-

ceedings of the 1994 symposium on Volume visualization, pages 91–98.

ACM New York, NY, USA, 1994.

[10] ME Casey and R. Nutt. A multicrystal two dimensional BGO detec-

tor system for positron emission tomography. Nuclear Science, IEEE

Transactions on, 33(1):460–463, 1986.

[11] CM Chen, S.Y. Lee, and ZH Cho. Parallelization of the EM algorithm

for 3-D PET imagereconstruction. IEEE Transactions on Medical Imag-

ing, 10(4):513–522, 1991.

[12] Arthur H. Compton. A Quantum Theory of the Scattering of X-rays

by Light Elements. Physical Review, 21(5):483–502, May 1923.

[13] AP Dempster, NM Laird, and DB Rubin. Maximum Likelihood from

Incomplete Data via the EM Algorithm. Journal of the Royal Statistical

Society. Series B (Methodological), 39(1):1–38, 1977.

[14] K.L. Giboni, E. Aprile, T. Doke, M. Hirasawa, and M. Yamamoto. Co-

incidence timing of Schottky CdTe detectors for tomographic imaging.

Nuclear Inst. and Methods in Physics Research, A, 450(2-3):307–312,

2000.

[15] Z. He, W. Li, GF Knoll, DK Wehe, J. Berry, and CM Stahle. 3-D

position sensitive CdZnTe gamma-ray spectrometers. Nuclear Instru-

ments and Methods in Physics Research-Section A Only, 422(1):173–

178, 1999.

[16] K. Herholz, E. Salmon, D. Perani, JC Baron, V. Holthoff, L. Fr

”olich, P. Sch

”onknecht, K. Ito, R. Mielke, E. Kalbe, et al. Discrimination between

Alzheimer dementia and controls by automated analysis of multicenter

FDG PET. Neuroimage, 17(1):302–316, 2002.

[17] IK Hong, ST Chung, HK Kim, YB Kim, YD Son, and ZH Cho. Ul-

tra fast symmetry and SIMD-based projection-backprojection (SSP)

algorithm for 3-D PET image reconstruction. IEEE Transactions on

Medical Imaging, 26(6):789–803, 2007.

113

[18] HM Hudson and RS Larkin. Accelerated image reconstruction using

ordered subsets ofprojection data. Medical Imaging, IEEE Transactions

on, 13(4):601–609, 1994.

[19] Wen-Mei Hwu and David Kirk. Programming Massively Parallel Pro-

cessors. Web site: http://courses.ece.uiuc.edu/ece498/al1/, Last

accessed: 05/06/2008.

[20] Intel. Intel C++ Compiler for Linux Intrinsics Reference. Intel.

[21] Intel. Server Board S5000PSL, Product Brief. Intel.

[22] Intel. Workstation Board S5000XVN, Product Brief. Intel.

[23] CA Johnson, Y. Yan, RE Carson, RL Martino, and ME Daube-

Witherspoon. A system for the 3D reconstruction of retracted-septa

PET datausing the EM algorithm. IEEE Transactions on Nuclear Sci-

ence, 42(4 Part 1):1223–1227, 1995.

[24] JP Jones, WF Jones, F. Kehren, DF Newport, JH Reed, MW Lenox,

LG Byars, K. Baker, C. Michel, ME Casey, et al. SPMD cluster-based

parallel 3D OSEM. In 2002 IEEE Nuclear Science Symposium Confer-

ence Record, volume 3, 2002.

[25] W.F. Jones, M.E. Casey, and L.G. Byars. Design of super-fast three-

dimensional projection system for Positron Emission Tomography,

June 29 1993. US Patent 5,224,037.

[26] P.M. Joseph. Improved algorithm for reprojecting rays through pixel

images. Medical Imaging, IEEE Transactions on, 1(3):192–196, 1982.

[27] DJ Kadrmas. Rotate-and-Slant Projector for Fast LOR-Based Fully-3-

D Iterative PET Reconstruction. IEEE Transactions on Medical Imag-

ing, 27(8):1071–1083, 2008.

[28] F. Kehren. Vollstndige iterative Rekonstruktion von dreidimension-

alen Positronen-Emissions-Tomogrammen unter Einsatz einer spe-

icherresidenten Systemmatrix auf Single- und Multiprozessor-Systemen.

PhD thesis, Rheinisch-Westflischen Technischen Hochschule (RWTH)

Aachen, 2001.

114

[29] T. Kimble, M. Chou, and BHT Chai. Scintillation properties of LYSO

crystals. Nuclear Science Symposium Conference Record, 3:1434–1437,

2002.

[30] M.A. Lodge, R.D. Badawi, R. Gilbert, P.E. Dibos, and B.R. Line. Com-

parison of 2-Dimensional and 3-Dimensional Acquisition for 18F-FDG

PET Oncology Studies Performed on an LSO-Based Scanner. Journal

of Nuclear Medicine, 47(1):23–31, 2006.

[31] DB Loveman. High performance fortran. IEEE [see also IEEE Con-

currency] Parallel & Distributed Technology: Systems & Applications,

1(1):25–42, 1993.

[32] Christian Maas. Freeware Hex Editor XVI32. Web site: http://www.

chmaas.handshake.de/delphi/freeware/xvi32/xvi32.htm, Last ac-

cessed: 02/24/2008.

[33] S.A. Mahlke, R.E. Hank, J.E. McCormick, D.I. August, and W.W.

Hwu. A Comparison of Full and Partial Predicated Execution Support

for ILP Processors. International Symposium on Computer Architec-

ture, 23:138–150, 1995.

[34] A. Munschi. OpenCL Specification Version 1.0,, 12 2008.

[35] Nvidia. CUDA Matrix Transpose SDK Example. Web

site: http://developer.download.nvidia.com/compute/cuda/sdk/

website/samples.html, Last accessed: 09/13/2007.

[36] Nvidia. GeForce 8 Series Overview. Web site: http://www.nvidia.

com/page/geforce8.html, Last accessed: 3/4/2009.

[37] Nvidia. GeForce Family Overview. Web site: http://www.nvidia.

com/object/geforce_family.html, Last accessed: 3/4/2009.

[38] Nvidia. Technical Brief - NVIDIA GeForce 8800 GPU Archi-

tecture Overview. Web site: http://www.nvidia.com/content/

PDF/Geforce_8800/GeForce_8800_GPU_Architecture_Technical_

Brief.pdf, Last accessed: 05/02/2008, 2007.

[39] Nvidia. NVIDIA Compute PTX: Parallel Thread Execution Version

1.1. Nvidia, 2008.

115

[40] Nvidia. NVIDIA CUDA Compute Unified Device Architecture Program-

ming Guide Version 1.1. Nvidia, 2008.

[41] Nvidia. NVIDIA CUDA Compute Unified Device Architecture Program-

ming Guide Version 2.1. Nvidia, 2009.

[42] AW Paeth. A fast algorithm for general raster rotation. In Proceed-

ings on Graphics Interface’86/Vision Interface’86 table of contents,

pages 77–81. Canadian Information Processing Society Toronto, Ont.,

Canada, Canada, 1986.

[43] Vladimir Panin and Frank Kehren. Acceleration of josephs method for

full 3d reconstruction of nuclear medical images from projection data,

2008.

[44] VY Panin, F. Kehren, H. Rothfuss, D. Hu, C. Michel, and ME Casey.

PET reconstruction with system matrix derived from point source mea-

surements. Nuclear Science, IEEE Transactions on, 53(1 Part 1):152–

159, 2006.

[45] Ervin B. Podgorsak. Radiation Physics for Medical Physicists (Biolog-

ical and Medical Physics, Biomedical Engineering). Springer, 1 edition,

10 2005.

[46] E. Rutherford. The Scattering of α and β Particles by Matter and the

Structure of the Atom. Philosophical Magazine, 21:669, 1911.

[47] H. Scherl, B. Keck, M. Kowarschik, and J. Hornegger. Fast GPU-

based CT reconstruction using the common unified device architecture

(CUDA). In IEEE Nuclear Science Symposium Conference Record,

2007. NSS’07, volume 6, 2007.

[48] LA Shepp and Y. Vardi. Maximum likelihood reconstruction for emis-

sion tomography. Medical Imaging, IEEE Transactions on, 1(2):113–

122, 1982.

[49] M. Teras, T. Tolvanen, J.J. Johansson, J.J. Williams, and J. Knuuti.

Performance of the new generation of whole-body PET/CT scanners:

Discovery STE and Discovery VCT. European Journal of Nuclear

Medicine and Molecular Imaging, 34(10):1683–1692, 2007.

116

[50] P.A. Tipler. Physics for Scientists and Engineers. Worth Publishers

New York, NY, 1991.

[51] Wladimir J. van der Laan. Cubin Utilities. Web site: http://www.cs.

rug.nl/~wladimir/decuda/, Last accessed: 03/26/2009.

[52] S. Vollmar, C. Michel, JT Treffert, DF Newport, M. Casey, C. Knoss,

K. Wienhard, X. Liu, M. Defrise, and W.D. Heiss. HeinzelCluster:

accelerated reconstruction for FORE and OSEM3D. In 2001 IEEE

Nuclear Science Symposium Conference Record, volume 3, 2001.

[53] Stefan Vollmar. VINCI: Volume Imaging in Neurological Research, Co-

Registration and ROIs included. Web site: http://www.nf.mpg.de/

vinci3/, Last accessed: 4/2/2008.

[54] Z. Wang, G. Han, T. Li, and Z. Liang. Speedup OS-EM image re-

construction by PC graphics card technologies for quantitative SPECT

with varying focal-length fan-beam collimation. IEEE transactions on

nuclear science, 52(5 Part 1):1274–1280, 2005.

[55] Wikipedia. Nvidia Tesla. Web site: http://en.wikipedia.org/wiki/

NVIDIA_Tesla, Last accessed: 3/4/2009.

117

Documents

Diploma Thesisstarba.se/Thesis_Sebastian_Schaetz.pdfDiploma Thesis Hybrid implementation of an iterative reconstruction algorithm for Positron Emission Tomography sinograms ... September