Design and implementation of GPU-based SAR image processor

Najeeb AhmadMaster Thesis Presentation

May, 2012

Supervisor: Dr. Sun Jinping

Design and Implementation of GPU based SAR Image

Processor

School of Electronic Information EngineeringBeihang University, Beijing China.

Contents1. Introduction2. GPU Computing3. SAR Processing4. Implementation5. Conclusion & Future Work

1.IntroductionProblemMotivationObjectiveMethodology

PROBLEMSynthetic Aperture Radar data processing is a computationally intensive and time consuming task using conventional CPUs. Given the increasing popularity and use of GPU for scientific computing, it is required to accelerate simplified range Doppler SAR processing algorithm on GPU using modern GPGPU technology to achieve real/near real-time performance and to evaluate its suitability for SAR processing.

MOTIVATIONComputationally intensive and time

consuming nature of SAR processing algorithms.

Inherent algorithm parallelism in most SAR processing algorithms.

Advent of modern GPGPU technology and availability of commodity GPUs as general purpose computation engines.

Architectural parallelism and availability of sufficient hardware resources in modern GPUs rendering them especially useful for handling large data quantities and parallel SAR algorithm implementation.

OBJECTIVETo implement and accelerate simplified

range Doppler SAR processing algorithm on a modern NVIDIA TESLA GPU using CUDA and MATLAB-GPU capabilities.

The resulting research will explore the areas like:Algorithm adaptation for parallel

implementation.Suitability of MATLAB for algorithm

implementation.Suitability of CUDA for algorithm

implementation.Comparison of CPU/CUDA/MATLAB-GPU

implementations.GPU as SAR processing platform.

METHODOLOGYAlgorithm implementation and verification

on Intel Xeon CPU using MATLAB.Identification of parallelizable portions of

algorithm.Algorithm implementation on TESLA C1060

GPU using MATLAB’s native GPU capabilities.

Algorithm implementation on TESLA C1060 GPU using CUDA.

Analysis of CPU, MATLAB-GPU and CUDA implementations.

2.GPU ComputingIntroduction to GPU ComputingGPGPU: Brief HistoryNVIDIA CUDAWriting efficient code

Introduction to GPU ComputingUse of Graphics Processing Units (GPUs) for

general purpose computing applications.CPU: Single, four or eight cores. Capable of

handling few threads. Suitable for serial code.

GPU: Hundreds of cores. Capable of handling hundreds of threads. Suitable for parallel code.

Introduction to GPU ComputingGPU Computing Model: Heterogeneous

computing model employing both CPU and GPU with serial computing on CPU, parallel computing on GPU.

GPGPU: Brief HistoryFirst use of GPU as general purpose

computing device, around 1999-2000 using graphics APIs. Huge performance boosts observed. Generally unpopular due to tedious programming.

Introduction of NVIDIAs “CUDA” and AMDs “Stream Computing” in 2007. Beginning of modern GPGPU era. Other vendors introduced their own GPGPU systems.

NVIDIAs CUDA gaining popularity due to its maturity and performance.

NVIDIA CUDACompute Unified Device Architecture.Comprises of Instruction Set Architecture

(ISA) and parallel compute engine in GPU programmable with high level languages extended for GPU computing.

CUDA framework comprises of two parts; hardware and software. From software perspective, CUDA means extended C/C++, FORTRAN to support GPU computing.

CUDA is “Single Instruction Multiple Thread” (SIMT) architecture.

CUDA HardwareStreaming multiprocessor (SM): Basic computing unit of

the GPU. Comprises of eight streaming processors (SP) and memory. Different GPUs differ in number of SMs and SP clock frequency.

SFU SFU

Shared Memory

CUDA Memory ArchitectureUnderstanding of memory architecture

critical for writing efficient CUDA programs.All CUDA-enabled hardware have following

types of memory:Global memoryShared memory and registers.Texture memory and texture cache.Constant memory and constant cache.Local memory for register spilling.

SP SPShared memory

SP SP SP

Texture cache

Constant cache

SP SPShared memory

SP SP SP

Texture cache

Constant cache

SP SPShared memory

SP SP SP

Texture cache

Constant cache

SP SPShared memory

SP SP SP

Texture cache

Constant cache

SM 1SM 2

Global memory (RAM)

Local MemoryTexture memory Constant memory

NVIDIA TESLA C1060 GPUPCI Express 2.0 compliant computing

processor board based on NVIDIA Tesla T10 graphics processing unit targeted for HPC applications. Feature highlights30 SMs = 240 SPs.SP Clock = 1.296 GHz4 GB DDR3 memory with 120

GB/s bandwidth. IEEE 754 single and double

floating point compliant.933 GFLOPS single and 78

GFLOPS double precision performance.

Compute capability: 1.3Supported by MATLAB for GPU

computing

CUDA Programming ModelAt its core are thread groups, shared

memory and barrier synchronization.Provides coarse-grained data and task

parallelism and fine-grained data and thread parallelism providing expressivity and scalability.

Thread hierarchy: Grid, blocks, threads.Kernels: Functions executed on device

(GPU) in parallel threads.CUDA provides APIs to run and launch

kernels in parallel threads and to synchronize them.

Processing FlowCopy input data from CPU to GPU memory.Load GPU program and execute, caching

result on the device.Copy results from GPU to CPU.

Global memory

Constant

Texture

Device

Writing Efficient CodeHigh priority considerations

Minimum CPU-GPU transfers.Use of coalesced data transfers.Use of shared memory instead of global

memory whenever possible.Avoiding different execution paths within a

warp.Medium priority considerations

Access to shared memory should be planned to avoid serialization.

Redundant data transfers from global memory should be avoided.

Writing Efficient CodeThreads per block should be multiple of 32.Use of fast math library whenever possible.

Low Priority ConsiderationsUse of zero copy operations.For kernels with long argument list, some

argument should be placed in constant memory.

Expensive modulo, division operations should be avoided in favor of shift operations whenever possible.

Automatic conversion of double to float should be avoided.

Loop unrolling should be used whenever possible.

3.SAR ProcessingWhat is Synthetic Aperture RadarSAR ProcessingProcessing AlgorithmsBasic RDASimplified RDA

What is Synthetic Aperture RadarAn active microwave remote sensing imaging system.Employs long range propagation characteristics of radar

and complex signal processing techniques to produce high resolution images.

High resolution achieved by synthesizing long antenna aperture through signal processing techniques.

Pros (in comparison with optical systems):All weather and day and night operation.No effects of constituents of atmosphere.Sensitivity to dielectric properties (can image ice, biomass

etc.)Sensitivity to surface roughness (oceans, wind speed etc.)

What is Synthetic Aperture Radar

Accurate measurement of distance.Sensitivity to man made objects.Sensitivity to target structure.Subsurface penetration.

Cons Complex interactions (difficult to visualize

and understand)Speckle effects (difficult in visual

interpretation)Topographic effects

SAR ProcessingA set of procedures to obtain interpretable image

from raw scattered in azimuth and range directions.In range, data is scattered by duration of transmitted

FM pulse.In azimuth, data spread by duration point target is

illuminated by the radar beam. SAR processing compresses this data taking into

account range cell migration, earth curvature, earth rotation, air/spacecraft attitude noise to produce the final image.

Given nature of SAR system and signals, signal processing rather than image processing provide appropriate tools for SAR processing.

SAR Processing AlgorithmsMainstream SAR processing include:

Range Doppler algorithm (RDA)High resolution images for low squint and for

relatively smaller aperture sizes. Very popular.Chirp scaling algorithm (CSA)

Two-dimensional operations with range independence followed by range corrections in range Doppler domain.

Omega-K algorithm (ωKA)Efficient and accurate in two-dimensional frequency

domain.SPECAN algorithm

Good for medium to low resolution requirements.

Range Doppler AlgorithmVersions of range Doppler:

Basic RDARDA with accurate SRCRDA with approximate SRCSimplified range Doppler

Basic RDARaw data Range

Compression Azimuth FFT

RCMCAzimuth Compression

Azimuth IFFT and lookup Summation

Final Image

Range FFT, matched filter multiply, range

Data in range Doppler domain

Interpolation operation in

range Doppler domain

Azimuth matched filter

multiply

To bring back signal into time

domain.

Simplified RDAFor narrower swath width and medium

resolution requirements, RCM can be assumed independent of range.Raw data Pre-filtering Range

Compression

Azimuth FFTRCMCRange IFFT

Azimuth Compression

Azimuth IFFT and lookup Summation

Final Image

To remove Doppler centroid

Range FFT, matched filter multiply (No range IFFT)

Both range and azimuth in frequency domain

RCM phase function

multiply with each range line

Data in range Doppler domain

4.ImplementationHardware resourcesSoftware resourcesCPU ImplementationMATLAB GPU ImplementationCUDA ImplementationResult Comparison

Hardware resourcesCPU GPU

Name NVIDIA Tesla C1060

# of cores 240SP Clock 1.296 GHzMemory 4 GB GDDR3Maximum memory bandwidth

102 GB/s

Memory interface

512 bit – PCI Express

GFLOPS 933 single precision, 78 double precision

Name Intel Xeon E5504

CPU Clock 2 GHz# of cores 4System Memory

DDR3 Clock 800 MHzMaximum memory bandwidth

19.2 GB/s

Memory type DDR3 PC3PCI Slot PCI Express

Software resourcesCPU GPUWindows 7

Ultimate 64-bitMATLAB release

2010bVisual Studio 2008

CUDA Toolkit 4.1MATLAB release

2010b NVIDIA Parallel

NsightVisual ProfilerCUDA MEMCHECKCUFFT library

RADARSAT – I Data• CEOS Format• Raw data is required to

be extracted from CEOS data before SAR processing algorithm can be applied.

Parameter Value UnitsSampling rate 32.317 MHzRange FM rate 0.7213

5MHz/µs

Pulse duration 41.74 µsRadar frequency 5.3 GHzRadar wavelength

0.05657

Pulse repetition frequency

1256.98

Effective radar velocity

7062 m/s

Azimuth FM rate 1733 Hz/sDoppler centroid -6900 Hz

Table RADARSAT – I data parameters

CEOS data

CEOS data extraction

utility

RAW SAR data

SAR Processing GUIFunctions• CEOS data

extraction.• MATLAB-

CPU SAR processing.

• MATLAB-GPU SAR processing

• CUDA input/output manipulation.

• CUDA program execution.

CPU ImplementationImplemented using MATLABFFT/IFFT using standard MATLAB functions

CPU Processed SAR image

A 2048 x 4096 SAR image using CPU based implementation

MATLAB-GPU ImplementationMATLAB started supporting GPU computing since

MATLAB release 2010b. Implemented using native MATLAB-GPU functions

only (no CUDA kernel calls).Vectorization strategy employed to implement

vector-matrix multiplications on GPU.

All FFT/IFFTs performed using MATLAB-GPU FFT/IFFT support functions.

Column 1

Column 2

………...

Column n

Column 1

Column 2

………...

Column n

Column 1

Column 2

………...

Column n

MATLAB-GPU ImplementationLimit on maximum image size that can be

calculated due to GPU memory constraints.

MATLAB-GPU ImplementationSpeedup as high as 21 achieved compared

with CPU implementation

MATLAB-GPU Implementation

A 2048 x 4096 SAR image using MATLAB-GPU based implementation

MATLAB-GPU ImplementationAdvantages

Quick and easy to implementSufficient speedups obtained with little effortLittle knowledge of GPU hardware and no

knowledge of optimization techniques required.Disadvantages

Currently, limited number of MATLAB functions supported on GPU.

Not all overloads of a function available for GPU.Lesser control of hardware resources and

memory.Not many optimization options.

CUDA ImplementationStrategy

Signal data read as binary fileVectors, matched filters calculated on CPUVectors/signal data transferred to GPUFollowing kernels executed in order on GPU

Pre-filtering kernelRange compression kernelRCMC kernelAzimuth compression kernelImage pixel calculation kernel

Data transferred from GPU to CPU and saved on disk as image.

Optimization considerationsChosen block size = 8 × 8 = 64. Conforms

with memory coalescing requirements.Constant variables stored in constant

memoryLocal variable and phase function

calculation whenever possible to reduce global memory access.

CPU-GPU data transfer kept to minimum by transferring data from CPUGPU at beginning and GPUCPU transfers at the end of algorithm.

Using CUFFTs cufftPlanMany() plan for FFT/IFFTs along data columns.

CUDA Implementation Results

A 2048 x 4096 SAR image using CUDA based implementation

CUDA Implementation Results

CUDA/MATLAB-CPU/MATLAB-CPU Computation Time Comparison

MATLAB-GPU/CUDA Computation Time Comparison

MATLAB-GPU/CUDA speedup comparisonSpeedups as high as 53 times achieved in

comparison with maximum speedup of 21 times in MATLAB.

5. Conclusions & Future Work

ConclusionsFeasibility of GPU for SAR processing

Amount of data, computational effort and inherent algorithm parallelism makes SAR processing suitable on GPU.

TESLA C1060 GPU offers enough memory to handle various common SAR image sizes.

Cooling GPU may be a challenge in some environments.

Scalability of CUDA will prove to be an advantage to port existing SAR code to newer GPUs.

GPUs might not be suitable where customizable hardware is required or military hardware standards are to be adhered.

ConclusionsMATLAB-GPU based SAR Processing

Significant speedups compared with CPU.Quick and easy to implement.Has some limitations:

Currently have lesser function support for GPU. Expected to improve with future MATLAB releases.

Vectorization strategy needs more memory. Future release promise to take away need for vectorization (e.g. bsxfun in release 2012a).

Lesser control over GPU resources (memory etc.).CUDA SAR Processing

CUDA: Flexible and scalable with least learning curve.More control over GPU resources.Optimization strategies can be applied.Faster and more memory efficient than MATLAB

implementation.

ConclusionsDownsides of GPU

Significant testing/verification effort might be required if GPU hardware have to be upgraded (due to old one becoming obsolete).

Proprietary nature of CUDA might be problematic in case company discontinues CUDA or its support.

Future workCUDA kernels can be called in MATLAB code

using MATLAB’s CUDA kernel calling support.

MATLAB GPU implementation can be improved as newer and better functions become available.

C/C++ based CPU implementation can be developed to better judge MATLAB-CPU/CUDA performance.

Other SAR processing algorithms can be implemented using framework laid out in this project.

Thank You

Design and implementation of GPU-based SAR image processor

Engineering

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSORfdubois/organisation/05dec08/... · LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR ... Use/developmentof Thermal LBM

On the Efficacy of a Fused CPU+GPU Processor (or … › 4208 › 0d94b904fa38b9d23c...synergy.cs.vt.edu On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSORfdubois/organisation/05dec08/... · LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR. Frédéric Kuznik ... 3 LBM based

GPU Security Exposed - Black Hat Briefings€¦ · GPU Security Exposed Exploiting Shared Memory Justin Taft. 2 . 1 Presentation Overview Shared Memory Internals GPU Command Processor

The Compute Architecture of Intel® Processor Graphics Gen7parallel.vub.ac.be/education/gpu/doc/Compute_Architecture_of_Intel... · The Compute Architecture of Intel® Processor Graphics

Investigating the Use of GPU-Accelerated Nodes for SAR ...bmi.osu.edu/hpc/slides/Hartley09-PPAC.pdf · 1 Dep. of Biomedical Informatics Timothy Hartley “GPU Clusters for SAR”

Introduction to SAR Interferometry - NASA Arset · Introduction to SAR Interferometry . Eric Fielding . ... - Thermal and Processor Noise - Differential Geometric and Volumetric Scattering

Alaska SAR Facility RADARSAT Geophysical Processor System

NVIDIA Tesla C870 GPU Computing Processor Board · April 14, 2008 | BD-03399-001_v04 1 NVIDIA Tesla C870 Overview The NVIDIA® Tesla™ C870 GPU computing processor board is a PCI

Modular SAR Processor - MSP · 2017-03-21 · Modular SAR Processor - MSP - 6 - 1. Introduction The Modular SAR Processor (MSP) is a system for deriving synthetic aperture radar images

LAUNCH - Wener · PDF fileLAUNCH High Level Chip Processor, Extremely Smooth 9Application Processor CPU: K3V2 Processor, 28nm HPM, Quad Core 1.2GHZ 9Graphics Processing Unit GPU: 16

Review for Modern GPU Hardwareviplab.cs.nctu.edu.tw/course/VLSIDSP2020_Spring/VLSIDSP_CHAP10… · CPU GPU Application Vertex Processor Rasterize Fragment Processor Video Memory (Textures)

SAR Image Simulations Using the LBM Algorithm on MPI-GPU

Time-Domain Bistatic SAR Processor - UBC ECEyewn/papers/DRDC-example-report.pdf · Time-Domain Bistatic SAR Processor ... des techniques seront mises au point pour le traitement des

A Scalable GPU Architecture based on Dynamically …web.yonsei.ac.kr/wjlee/document/HPG2011.samsung.wjlee... · 2015-01-01 · External Memory Pixel Processor #1 Pixel Processor #2

GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

GRID VIRTUAL GPU - IBMdelivery04.dhe.ibm.com/sar/CMA/XSA/nvda_dd_video_352.54_vmware… · User Guide GRID VIRTUAL GPU . GRID Virtual GPU DU-06920-001 ... 2.1.7 Removing a VM’s

Design of a bistatic SAR processor for GEOSAR systems A

Time-Domain Bistatic SAR Processor

Tesla 101. 2 CUDA GPU Accelerates Computing The Right Processor for the Right Task