FPGA Accelerated 3-D Tomography

FPGA Accelerated3-D Tomography

Richard Dorrance

Progress Update: 09/07/12

2

Outline Introduction to Tomography

Reconstruction Methods– Analytical

o Backprojectiono Filtered Backprojection

– Algebraico Algebraic Reconstruction Technique (ART)o Simultaneous Iterative Reconstruction Technique (SIRT)o Simultaneous Algebraic Reconstruction Technique (SART)

Modeling Performance of Reconstruction Methods

Future Work

3

Tomography Cross-sectional imaging technique using transmission

or reflection data from multiple angles

Basis for CAT scan, MRI,PET, SPECT, ET, etc.

Computed Tomography (CT):A form of tomographic reconstruction on computers

4

Cross-Sections by X-Ray Projections Project X-ray through biological tissue;

measure total absorption of ray by tissue

Projection Pθ(t) is the Radontransform of object functionf(x,y):

Total set of projections calledsinogram

, cos sinP t f x y x y t dxdy

5

Shepp-Logan Phantom Standard test image for tomographic reconstructions

6

Example Image with Projections

1 3 1

1 2 1

4 1 1

5

4

6

14

42

46 6 312

74

1

7

CT Reconstruction Restore image from projection data

Inverse Radon transform

Most common algorithm is filtered backprojection– “Smear” each projection over image plane

Accuracy of reconstruction depends on the number of detectors and projection angles

Original 4 Angles 16 Angles 64 Angles 256 Angles

8

Analytical Reconstruction Methods (Filtered) Backprojection Pseudo Code:

– Input: sinogram sino(θ, N)– Output: image img(x,y)

for each θ

filter sino(θ,:) ; only for FBP

for each x

for each y

n = x*cos(θ) + y*sin(θ)

img(x,y) = sino(θ,n) + img(x,y)

9

Backprojection (Step 1)

0 0 0

0 0 0

0 0 0

5

4

6

14

42

46 6 312

74

1

10


5 5 5

4 4 4

6 6 6

5

4

6

14

42

46 6 312

74

1

11


9 9 6

6 8 8

10 8 10

5

4

6

14

42

46 6 312

74

1

12


15 15 9

12 14 11

16 14 13

5

4

6

14

42

46 6 312

74

1

13


16 19 16

16 21 13

23 16 14

5

4

6

14

42

46 6 312

74

1

14

Backprojection vs. Original Final Step: normalize image power

– Divide each pixel by θ·N

1.33 1.58 1.33

1.33 1.75 1.08

1.92 1.33 1.17

1 3 1

1 2 1

4 1 1

15

Note On Filtering

No Filtering With Filtering

16

Filtered Backprojection (Step 1)

0 0 0

0 0 0

0 0 0

1.22

-1.220.61

0.39-0.84

1.061.16 0.49 0-0.11-0.84

1.55-0.06

-0.55

-0.73

1.61

17


1.22 1.22 1.22

-0.73 -0.73 -0.73

1.61 1.61 1.61

1.22

-1.220.61

0.39-0.84

1.061.16 0.49 0-0.11-0.84

1.55-0.06

-0.55

-0.73

1.61

18


1.61 1.83 0

-1.57 -0.34 -0.12

2.67 0.77 2

1.22

-1.220.61

0.39-0.84

1.061.16 0.49 0-0.11-0.84

1.55-0.06

-0.55

-0.73

1.61

19


0.45 2.32 0

-0.41 0.15 -0.12

3.83 1.26 2

1.22

-1.220.61

0.39-0.84

1.061.16 0.49 0-0.11-0.84

1.55-0.06

-0.55

-0.73

1.61

20


-0.1 2.26 1.55

-0.47 1.7 -0.96

5.38 0.42 1.89

1.22

-1.220.61

0.39-0.84

1.061.16 0.49 0-0.11-0.84

1.55-0.06

-0.55

-0.73

1.61

21

Filtered Backprojection vs. Original

-0.1 2.26 1.55

-0.47 1.7 -0.96

5.38 0.42 1.89

1 3 1

1 2 1

4 1 1

22

Conventional Algebraic Reconstruction Methods

23

Problem Formulation We want to formulate it as a Linear Inverse Problem:

Where x is a column vector of length N2 representing the pixels of the original image, A is an M by N2 matrix representing the data acquisition process, and b is a column vector of length M representing the measured projection data.

We want to find a solution such that:

bAx

bAx left1

24

Notes on the Discretized Image x The discretized image is denoted by:

and by:

where x is obtained by stacking the columns of X.

NNX R

12

Rvec NXx

25

Notes on the projection data b There are a total of d detectors and θ projection angles,

so that a total of M = d · θ are used.

Then the measured projection data is denoted by:

and by:

where b is obtained by stacking the columns of B.

dB R

11 RRvec MdBb

26

Notes on the Acquisition Matrix A The acquisition of projection data b from x is modeled

by:

where:

ai,j is the contribution of pixel j to projection i.

Also, let:

be a column matrix that represents the ith ray which computes the value of the ith projection.

2

R NMA

M.,,,ixabN

jjjii 21,

2

1,

Ti iAA :,

27

Iterative Reconstruction Algorithm Let x(k) denote the kth estimation of the reconstruction.

Then:

where the relaxation factor λ is a scalar.

bAxAxx kTkk 1

28

Proof of Convergence [1] Let

Then

AAI T

bAIx

bAIx

bAxx

Tkk

Tk

Tkk

01

12

1

29

Proof of Convergence [2] If ATA is positive definite and λ is chosen so that the

spectral radius of Δ is less than 1, then:

and

0lim 1

k

k

1lim

IIk

k

30

Proof of Convergence [3] Therefore:

bA

bAAA

bAIx

left

TT

Tk

k

1

1

1lim

31

# of Projections needed for ART Reconstruction on a square grid (N×N) with N detectors Assuming a circular reconstruction region, we can

ignore pixels outside this region

pixels 4

2N

4

4

detectors of #

pixels of # 2 N

N

NART

32

# of Projections needed for FBP [1] Reconstructing region with diameter L

Sampling interval is at least:

with a maximum frequency of:

Due to polar sampling,the density of samplesdecreases as we gooutward on the polar grid

NL

1

L2

1max

33

# of Projections needed for FBP [2] To ensure a sampling rate of at least Δω everywhere:

therefore:

2

NFBP

NL

NL 2

21

1

max

34

Matrix Formulation with Normalization Introduce diagonal matrices V and W:

V: diagonal matrix of theinverse of the row sums

W: diagonal matrix of theinverse of the column sums

bAxWVAxx kTkk 1

2

1,

,

1N

jji

iii

a

VV

M

iji

jjj

aWW

1,

,

1

35

Reconstruction Methods Algebraic Reconstruction Technique

– Update image after each ray is processed

Simultaneous Iterative Reconstruction Technique– Update image after all rays are processed

Simultaneous Algebraic Reconstruction Technique– Update image after all rays in a single projection angle

are processed

36

ART Image update method:

– After each ray is processed

Pseudocode:

for k = 1:K

for i = 1:M

end

end

iiTiii

ii bxAWAVxxi

1

1 ik xx

37

ART (Iterations 1-6)

1 3.03 1.06

0.97 2 1.03

3.94 0.97 1

1 2.99 0.98

1.01 2 0.99

4.02 1.01 1

1 3 1

1 2 1

4 1 1

Iteration 4 Iteration 5 Iteration 6


1 3 0.83

1 1.83 0.75

4.33 1.25 1

1 3 1

1 2 1

4 1 1

1 3 1

1 2 1

4 1 1

38

SIRT Image update method:

– After all rays are processed

Pseudocode:

for k = 1:K

end

bAxWVAxx kTkk 1

39

SIRT (Iterations 1-6, λ = 0.5)

0.67 3.5 0.66

0.83 2.17 0.33

5.83 0.67 0.33

0.78 3.43 0.86

0.76 2.08 1.01

4.28 0.85 0.94

0.94 3.2 0.91

0.87 2.04 0.99

4.12 0.91 1.02

0.97 3.1 0.95

0.94 2.02 1

4.05 0.96 1.01

0.99 3.05 0.97

0.97 2.01 1

4.03 0.98 1.01



0.99 3.03 0.99

0.98 2.01 1

4.01 0.99 1

40

SART Image update method:

– After all rays in a single projection angle are processed

Pseudocode:

for k = 1:K

for θ = 1:Θ

end

end

bxAWAVxx T 1

1 xx k

41

SART (Step 1, Iteration 1, Theta 1)

1.67 1.67 1.67

1.67 1.67 1.67

1.67 1.67 1.67

5

5

5

42


1.67 1.67 1.67

1.33 1.33 1.33

2 2 2

03

55

33.03

54

33.03

56

43


1.67 1.67 1.67

1.33 1.33 1.33

2 2 2

67.13

533.3

2

44


1.33 2.17 1

0.67 1 1.83

4 1.33 1.67

67.0

1 67.1

1

5.0

2 34

33.0

3 54

67.0

2 33.3

2

2

1 24

45


6 5.4

1.33 2.17 1

0.67 1 1.83

4 1.33 1.67

5.4

46


1.33 2.67 0.5

0.67 1.5 1.33

4 1.83 1.17

03

66

5.0

3

5.4

3

5.0

3

5.4

6

47


617.3

17.1

33.333.1

1.33 2.67 0.5

0.67 1.5 1.33

4 1.83 1.17

48

ART (Step 2, Iteration 1, Theta 4)

1 3 0.83

1 1.83 0.75

4.33 1.25 1

33.0

1

33.11

33.0

2

33.34

33.0

3

67

59.0

2

17.32

17.0

1

17.11

49

SART (Iterations 1-6)

1 3.03 1.06

0.97 2 1.03

3.94 0.97 1

1 2.99 0.98

1.01 2 0.99

4.02 1.01 1

1 3 1

1 2 1

4 1 1



1 3 0.83

1 1.83 0.75

4.33 1.25 1

1 3 1

1 2 1

4 1 1

1 3 1

1 2 1

4 1 1

50

Modeling Performance (CPU, GPU, FPGA) Write C pseudo code for Matrix-Vector multiplication

and Vector-Vector addition

Convert C pseudo code to application specific pseudo code (CPU = x86, GPU = OpenCL/CUDA)

Model latency and throughput of pseudo code given:– CPU architecture:

o Cache structure, freq., total # of threads, etc…

– Image reconstruction problem:o N, d, θ, A matrix sparsity (α), # of iterations, etc…

51

C Pseudo Code (Ax = b)float btemp;

float *Apos = &A[0][0];

for(int i=0; i<M; i++)

{

float *xpos = &x[0];

btemp=0;

for(int j=0; j<N; j++)

{

btemp += (*Apos++) * (*xpos++);

}

b[i] = btemp;

}

52

x86 Pseudo Code (Ax = b)loop_i: ;

fldz ; btemp = 0

mov eax, hXXXX ; j = M

loop_j: ;

fld dword ptr [edx] ; A_ij

add edx, 4 ; Apos++

fmul dword ptr [ecx] ; A_ij*x_j

add ecx, 4 ; xpos++

faddp st(1), st ; btemp = btemp + A_ij*x_j

dec eax ; j--

jnz short loop_j; loop if j~=0

fst dword [ebx] ; b_i = btemp

add ebx, 4 ; bpos++

dec esi ; i--

jnz short loop_i; loop if i~=0

53

Results for CPUs [1]Processor Xeon E5405 [1] Xeon E5405 [1] Xeon E5405 [1] Xeon E5405 [1]

Architecture Penryn Penryn Penryn PenrynOperating Frequency 2.00 GHz 2.00 GHz 2.00 GHz 2.00 GHzNumber of Cores 4 4 4 4Number of Threads per Core 1 1 1 1Total Threads Used 1 1 1 1

Reconstruction Specifics Number of Pixels (NxN) 1024x1024 1024x1024 512x512 512x512Number of Dectectors (D) 1024 1024 512 512Number of Angles (θ) 140 140 140 140Matrix Sparsity (α) 0.098% 0.098% 0.195% 0.195%Number of Iterations 30 30 30 30Loop Unrolling Yes Yes Yes YesSIMD or Floating Point? Floating Point SIMD Floating Point SIMD

Reconstruction Time Reported [s] 24.174 6.639 6.087 1.650Estimated [s] 22.478 6.307 5.613 1.570Accuracy [%] 92.982% 94.987% 92.214% 95.180%

[1] J.I. Agulleiro, E.M. Garzon, I. Garcia, J.J. Fernandez, "Multi-core Desktop Processors Make Possible Real-Time Electron Tomography," 2011 19th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp.127-132, Feb. 2011.

54

Results for CPUs [2]Processor Xeon 3.4 [2] Xeon 3.4 [2] Xeon 3.4 [2]

Architecture NetBurst NetBurst NetBurstOperating Frequency 3.40 GHz 3.40 GHz 3.40 GHzNumber of Cores 1 1 1Number of Threads per Core 1 1 1Total Threads Used 1 1 1

Reconstruction Specifics Number of Pixels (NxN) 2048x2048 1024x1024 512x512Number of Dectectors (D) 2048 1024 512Number of Angles (θ) 88 88 88Matrix Sparsity (α) 0.049% 0.195% 0.977%Number of Iterations 10 10 10Loop Unrolling Yes Yes YesSIMD or Floating Point? Floating Point Floating Point Floating Point

Reconstruction Time Reported [s] 4.512 2.227 1.336Estimated [s] 5.488 2.558 1.509Accuracy [%] 121.630% 114.875% 112.953%

[2] D.C. Diez, H. Mueller, A.S. Frangakis, "Implementation and Performance Evaluation of Reconstruction Algorithms on Graphics Processors," Journal of Structural Biology, vol. 157, no. 1, pp. 288-295, Jan. 2007.

55

Results for CPUs [3]Processor P4 2.40A [2] P4 2.40A [2] P4 2.40A [2]

Architecture Prescott Prescott PrescottOperating Frequency 2.40 GHz 2.40 GHz 2.40 GHzNumber of Cores 1 1 1Number of Threads per Core 2 2 2Total Threads Used 2 2 2

Reconstruction Specifics Number of Pixels (NxN) 2048x2048 1024x1024 512x512Number of Dectectors (D) 2048 1024 512Number of Angles (θ) 88 88 88Matrix Sparsity (α) 0.049% 0.195% 0.977%Number of Iterations 10 10 10Loop Unrolling Yes Yes YesSIMD or Floating Point? Floating Point Floating Point Floating Point


[2] D.C. Diez, H. Mueller, A.S. Frangakis, "Implementation and Performance Evaluation of Reconstruction Algorithms on Graphics Processors," Journal of Structural Biology, vol. 157, no. 1, pp. 288-295, Jan. 2007.

56

Results for CPUs [4]Processor 2x X5550 [3] 4x X7460 [3] 4x X7560 [3]

Architecture Nehalem Core NehalemOperating Frequency 2.66 GHz 2.66 GHz 2.27 GHzNumber of Cores 4 6 8Number of Threads per Core 2 1 2Total Threads Used 16 24 64

Reconstruction Specifics Number of Pixels (NxN) 512x512 512x512 512x512Number of Dectectors (D) 512 512 512Number of Angles (θ) 414 414 414Matrix Sparsity (α) 0.391% 0.391% 0.391%Number of Iterations 1 1 1Loop Unrolling No No NoSIMD or Floating Point? Floating Point Floating Point Floating Point


[3] H.G. Hofmann, B. Keck, C. Rohkohl, J. Hornegger, "Comparing Performance of Many-core CPUs and GPUs for Static and Motion Compensated Reconstruction of C-arm CT Data," Medical Physics, vol. 38, no 1, pp. 468-473, Jan. 2011.

57

Future Work Modeling performance of SART on GPUs and FPGAs

Documents

FPGA Accelerated 3-D Tomography