FPGA Accelerated 3-D Tomography

FPGA Accelerated3-D Tomography

Richard Dorrance

Progress Update: 09/07/12

Outline Introduction to Tomography

Reconstruction Methods– Analytical

o Backprojectiono Filtered Backprojection

– Algebraico Algebraic Reconstruction Technique (ART)o Simultaneous Iterative Reconstruction Technique (SIRT)o Simultaneous Algebraic Reconstruction Technique (SART)

Modeling Performance of Reconstruction Methods

Future Work

Tomography Cross-sectional imaging technique using transmission

or reflection data from multiple angles

Basis for CAT scan, MRI,PET, SPECT, ET, etc.

Computed Tomography (CT):A form of tomographic reconstruction on computers

Cross-Sections by X-Ray Projections Project X-ray through biological tissue;

measure total absorption of ray by tissue

Projection Pθ(t) is the Radontransform of object functionf(x,y):

Total set of projections calledsinogram

, cos sinP t f x y x y t dxdy

Shepp-Logan Phantom Standard test image for tomographic reconstructions

Example Image with Projections

46 6 312

CT Reconstruction Restore image from projection data

Inverse Radon transform

Most common algorithm is filtered backprojection– “Smear” each projection over image plane

Accuracy of reconstruction depends on the number of detectors and projection angles

Original 4 Angles 16 Angles 64 Angles 256 Angles

Analytical Reconstruction Methods (Filtered) Backprojection Pseudo Code:

– Input: sinogram sino(θ, N)– Output: image img(x,y)

for each θ

filter sino(θ,:) ; only for FBP

for each x

for each y

n = x*cos(θ) + y*sin(θ)

img(x,y) = sino(θ,n) + img(x,y)

Backprojection (Step 1)

46 6 312

10 8 10

46 6 312

15 15 9

12 14 11

16 14 13

46 6 312

16 19 16

16 21 13

23 16 14

46 6 312

Backprojection vs. Original Final Step: normalize image power

– Divide each pixel by θ·N

1.33 1.58 1.33

1.33 1.75 1.08

1.92 1.33 1.17

Note On Filtering

No Filtering With Filtering

Filtered Backprojection (Step 1)

-1.220.61

0.39-0.84

1.061.16 0.49 0-0.11-0.84

1.55-0.06

1.22 1.22 1.22

-0.73 -0.73 -0.73

1.61 1.61 1.61

-1.220.61

0.39-0.84

1.061.16 0.49 0-0.11-0.84

1.55-0.06

1.61 1.83 0

-1.57 -0.34 -0.12

2.67 0.77 2

-1.220.61

0.39-0.84

1.061.16 0.49 0-0.11-0.84

1.55-0.06

0.45 2.32 0

-0.41 0.15 -0.12

3.83 1.26 2

-1.220.61

0.39-0.84

1.061.16 0.49 0-0.11-0.84

1.55-0.06

-0.1 2.26 1.55

-0.47 1.7 -0.96

5.38 0.42 1.89

-1.220.61

0.39-0.84

1.061.16 0.49 0-0.11-0.84

1.55-0.06

Filtered Backprojection vs. Original

-0.1 2.26 1.55

-0.47 1.7 -0.96

5.38 0.42 1.89

Conventional Algebraic Reconstruction Methods

Problem Formulation We want to formulate it as a Linear Inverse Problem:

Where x is a column vector of length N2 representing the pixels of the original image, A is an M by N2 matrix representing the data acquisition process, and b is a column vector of length M representing the measured projection data.

We want to find a solution such that:

bAx left1

Notes on the Discretized Image x The discretized image is denoted by:

and by:

where x is obtained by stacking the columns of X.

Rvec NXx

Notes on the projection data b There are a total of d detectors and θ projection angles,

so that a total of M = d · θ are used.

Then the measured projection data is denoted by:

and by:

where b is obtained by stacking the columns of B.

11 RRvec MdBb

Notes on the Acquisition Matrix A The acquisition of projection data b from x is modeled

where:

ai,j is the contribution of pixel j to projection i.

Also, let:

be a column matrix that represents the ith ray which computes the value of the ith projection.

M.,,,ixabN

jjjii 21,

Ti iAA :,

Iterative Reconstruction Algorithm Let x(k) denote the kth estimation of the reconstruction.

where the relaxation factor λ is a scalar.

bAxAxx kTkk 1

Proof of Convergence [1] Let

Proof of Convergence [2] If ATA is positive definite and λ is chosen so that the

spectral radius of Δ is less than 1, then:

0lim 1

Proof of Convergence [3] Therefore:

# of Projections needed for ART Reconstruction on a square grid (N×N) with N detectors Assuming a circular reconstruction region, we can

ignore pixels outside this region

pixels 4

detectors of #

pixels of # 2 N

# of Projections needed for FBP [1] Reconstructing region with diameter L

Sampling interval is at least:

with a maximum frequency of:

Due to polar sampling,the density of samplesdecreases as we gooutward on the polar grid

# of Projections needed for FBP [2] To ensure a sampling rate of at least Δω everywhere:

therefore:

Matrix Formulation with Normalization Introduce diagonal matrices V and W:

V: diagonal matrix of theinverse of the row sums

W: diagonal matrix of theinverse of the column sums

bAxWVAxx kTkk 1

Reconstruction Methods Algebraic Reconstruction Technique

– Update image after each ray is processed

Simultaneous Iterative Reconstruction Technique– Update image after all rays are processed

Simultaneous Algebraic Reconstruction Technique– Update image after all rays in a single projection angle

are processed

ART Image update method:

– After each ray is processed

Pseudocode:

for k = 1:K

for i = 1:M

iiTiii

ii bxAWAVxxi

1 ik xx

ART (Iterations 1-6)

1 3.03 1.06

0.97 2 1.03

3.94 0.97 1

1 2.99 0.98

1.01 2 0.99

4.02 1.01 1

Iteration 4 Iteration 5 Iteration 6

1 3 0.83

1 1.83 0.75

4.33 1.25 1

SIRT Image update method:

– After all rays are processed

Pseudocode:

for k = 1:K

bAxWVAxx kTkk 1

SIRT (Iterations 1-6, λ = 0.5)

0.67 3.5 0.66

0.83 2.17 0.33

5.83 0.67 0.33

0.78 3.43 0.86

0.76 2.08 1.01

4.28 0.85 0.94

0.94 3.2 0.91

0.87 2.04 0.99

4.12 0.91 1.02

0.97 3.1 0.95

0.94 2.02 1

4.05 0.96 1.01

0.99 3.05 0.97

0.97 2.01 1

4.03 0.98 1.01

0.99 3.03 0.99

0.98 2.01 1

4.01 0.99 1

SART Image update method:

– After all rays in a single projection angle are processed

Pseudocode:

for k = 1:K

for θ = 1:Θ

bxAWAVxx T 1

1 xx k

SART (Step 1, Iteration 1, Theta 1)

1.67 1.67 1.67

1.33 1.33 1.33

1.67 1.67 1.67

1.33 1.33 1.33

1.33 2.17 1

0.67 1 1.83

4 1.33 1.67

1 67.1

2 33.3

1.33 2.17 1

0.67 1 1.83

4 1.33 1.67

1.33 2.67 0.5

0.67 1.5 1.33

4 1.83 1.17

33.333.1

1.33 2.67 0.5

0.67 1.5 1.33

4 1.83 1.17

ART (Step 2, Iteration 1, Theta 4)

1 3 0.83

1 1.83 0.75

4.33 1.25 1

SART (Iterations 1-6)

1 3.03 1.06

0.97 2 1.03

3.94 0.97 1

1 2.99 0.98

1.01 2 0.99

4.02 1.01 1

1 3 0.83

1 1.83 0.75

4.33 1.25 1

Modeling Performance (CPU, GPU, FPGA) Write C pseudo code for Matrix-Vector multiplication

and Vector-Vector addition

Convert C pseudo code to application specific pseudo code (CPU = x86, GPU = OpenCL/CUDA)

Model latency and throughput of pseudo code given:– CPU architecture:

o Cache structure, freq., total # of threads, etc…

– Image reconstruction problem:o N, d, θ, A matrix sparsity (α), # of iterations, etc…

C Pseudo Code (Ax = b)float btemp;

float *Apos = &A[0][0];

for(int i=0; i<M; i++)

float *xpos = &x[0];

btemp=0;

for(int j=0; j<N; j++)

btemp += (*Apos++) * (*xpos++);

b[i] = btemp;

x86 Pseudo Code (Ax = b)loop_i: ;

fldz ; btemp = 0

mov eax, hXXXX ; j = M

loop_j: ;

fld dword ptr [edx] ; A_ij

add edx, 4 ; Apos++

fmul dword ptr [ecx] ; A_ij*x_j

add ecx, 4 ; xpos++

faddp st(1), st ; btemp = btemp + A_ij*x_j

dec eax ; j--

jnz short loop_j; loop if j~=0

fst dword [ebx] ; b_i = btemp

add ebx, 4 ; bpos++

dec esi ; i--

jnz short loop_i; loop if i~=0

Results for CPUs [1]Processor Xeon E5405 [1] Xeon E5405 [1] Xeon E5405 [1] Xeon E5405 [1]

Architecture Penryn Penryn Penryn PenrynOperating Frequency 2.00 GHz 2.00 GHz 2.00 GHz 2.00 GHzNumber of Cores 4 4 4 4Number of Threads per Core 1 1 1 1Total Threads Used 1 1 1 1

Reconstruction Specifics Number of Pixels (NxN) 1024x1024 1024x1024 512x512 512x512Number of Dectectors (D) 1024 1024 512 512Number of Angles (θ) 140 140 140 140Matrix Sparsity (α) 0.098% 0.098% 0.195% 0.195%Number of Iterations 30 30 30 30Loop Unrolling Yes Yes Yes YesSIMD or Floating Point? Floating Point SIMD Floating Point SIMD

Reconstruction Time Reported [s] 24.174 6.639 6.087 1.650Estimated [s] 22.478 6.307 5.613 1.570Accuracy [%] 92.982% 94.987% 92.214% 95.180%

[1] J.I. Agulleiro, E.M. Garzon, I. Garcia, J.J. Fernandez, "Multi-core Desktop Processors Make Possible Real-Time Electron Tomography," 2011 19th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp.127-132, Feb. 2011.

Results for CPUs [2]Processor Xeon 3.4 [2] Xeon 3.4 [2] Xeon 3.4 [2]

Architecture NetBurst NetBurst NetBurstOperating Frequency 3.40 GHz 3.40 GHz 3.40 GHzNumber of Cores 1 1 1Number of Threads per Core 1 1 1Total Threads Used 1 1 1

Reconstruction Specifics Number of Pixels (NxN) 2048x2048 1024x1024 512x512Number of Dectectors (D) 2048 1024 512Number of Angles (θ) 88 88 88Matrix Sparsity (α) 0.049% 0.195% 0.977%Number of Iterations 10 10 10Loop Unrolling Yes Yes YesSIMD or Floating Point? Floating Point Floating Point Floating Point

Reconstruction Time Reported [s] 4.512 2.227 1.336Estimated [s] 5.488 2.558 1.509Accuracy [%] 121.630% 114.875% 112.953%

[2] D.C. Diez, H. Mueller, A.S. Frangakis, "Implementation and Performance Evaluation of Reconstruction Algorithms on Graphics Processors," Journal of Structural Biology, vol. 157, no. 1, pp. 288-295, Jan. 2007.

Results for CPUs [3]Processor P4 2.40A [2] P4 2.40A [2] P4 2.40A [2]

Architecture Prescott Prescott PrescottOperating Frequency 2.40 GHz 2.40 GHz 2.40 GHzNumber of Cores 1 1 1Number of Threads per Core 2 2 2Total Threads Used 2 2 2

Reconstruction Specifics Number of Pixels (NxN) 2048x2048 1024x1024 512x512Number of Dectectors (D) 2048 1024 512Number of Angles (θ) 88 88 88Matrix Sparsity (α) 0.049% 0.195% 0.977%Number of Iterations 10 10 10Loop Unrolling Yes Yes YesSIMD or Floating Point? Floating Point Floating Point Floating Point

[2] D.C. Diez, H. Mueller, A.S. Frangakis, "Implementation and Performance Evaluation of Reconstruction Algorithms on Graphics Processors," Journal of Structural Biology, vol. 157, no. 1, pp. 288-295, Jan. 2007.

Results for CPUs [4]Processor 2x X5550 [3] 4x X7460 [3] 4x X7560 [3]

Architecture Nehalem Core NehalemOperating Frequency 2.66 GHz 2.66 GHz 2.27 GHzNumber of Cores 4 6 8Number of Threads per Core 2 1 2Total Threads Used 16 24 64

Reconstruction Specifics Number of Pixels (NxN) 512x512 512x512 512x512Number of Dectectors (D) 512 512 512Number of Angles (θ) 414 414 414Matrix Sparsity (α) 0.391% 0.391% 0.391%Number of Iterations 1 1 1Loop Unrolling No No NoSIMD or Floating Point? Floating Point Floating Point Floating Point

[3] H.G. Hofmann, B. Keck, C. Rohkohl, J. Hornegger, "Comparing Performance of Many-core CPUs and GPUs for Static and Motion Compensated Reconstruction of C-arm CT Data," Medical Physics, vol. 38, no 1, pp. 468-473, Jan. 2011.

Future Work Modeling performance of SART on GPUs and FPGAs

FPGA Accelerated 3-D Tomography

Documents

FPGA architecture to search for accelerated pulsars with SKA · URSI GASS 2020, Rome, Italy, 29 August - 5 September 2020 FPGA architecture to search for accelerated pulsars with

FPGA-Accelerated Digital Signal Processing for UAV Traffic

FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

UWS Academic Portal Towards an FPGA-Accelerated ... · Towards an FPGA-Accelerated Programmable Data Path for Edge-to-Core Communications in 5G Networks Ruben Ricart-Sancheza,, Pedro

FPGA Accelerated Computing Using AWS F1 Instances · 2017-09-07 · FPGA Accelerated Computing Using AWS F1 Instances. Applications and development environment. F1. ... GPU and FPGA

Evaluation of RISC-V RTL with FPGA-Accelerated Simulation · MIDAS Custom Compiler Passes 10 §FIRRTL: IR for RTL transforms ... -RTL state snapshots for energy modeling r FPGA-Accelerated

A Heterogeneous GASNet Implementation for FPGA-accelerated Computingwillenbe/publications/PGAS2014... · 2014-10-22 · A Heterogeneous GASNet Implementation for FPGA-accelerated

Map-Reduce Processing of K-means Algorithm with FPGA ...hso/Publications/choi_asap14_kmeans.pdf · Map-Reduce Processing of K-means Algorithm with FPGA-accelerated Computer Cluster

Graphics processing unit accelerated intensity-based ... · Graphics processing unit accelerated intensity-based optical coherence tomography angiography using differential frames

Scalable FPGA-accelerated Cycle-Accurate Hardware

The FPGA Accelerated Controlled Entry System (FACES)

FirePerf: FPGA-Accelerated Full-System Hardware/Software

FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ...biancolin/papers/firesim-isca18.pdf · sign FPGA-based applications that run in the cloud. Using an FPGA-enabled public cloud

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007

FPGA Based Design for Accelerated Fault-testing of

FireSim: Productive, Scalable, FPGA-Accelerated Cycle

1 Designing an FPGA-Accelerated Homomorphic Encryption Co …rohloff/papers/2017/Rohloff... · 2016-10-04 · 1 Designing an FPGA-Accelerated Homomorphic Encryption Co-Processor David

Energy-efﬁcient FPGA-accelerated LiDAR-based SLAM for

An Open-Source FPGA-Accelerated x86 Full-System Emulator by Elias El Ferezli A thesis

Multi-Hybrid Accelerated Supercomputer: The new …...Next generation’s accelerated computing system! FPGA for HPC as large scale parallel system! AiS(Accelerator in Switch) concept!