View
217
Download
2
Category
Preview:
Citation preview
Najeeb AhmadMaster Thesis Presentation
May, 2012
Supervisor: Dr. Sun Jinping
Design and Implementation of GPU based SAR Image
Processor
School of Electronic Information EngineeringBeihang University, Beijing China.
Contents1. Introduction2. GPU Computing3. SAR Processing4. Implementation5. Conclusion & Future Work
1.IntroductionProblemMotivationObjectiveMethodology
PROBLEMSynthetic Aperture Radar data processing is a computationally intensive and time consuming task using conventional CPUs. Given the increasing popularity and use of GPU for scientific computing, it is required to accelerate simplified range Doppler SAR processing algorithm on GPU using modern GPGPU technology to achieve real/near real-time performance and to evaluate its suitability for SAR processing.
MOTIVATIONComputationally intensive and time
consuming nature of SAR processing algorithms.
Inherent algorithm parallelism in most SAR processing algorithms.
Advent of modern GPGPU technology and availability of commodity GPUs as general purpose computation engines.
Architectural parallelism and availability of sufficient hardware resources in modern GPUs rendering them especially useful for handling large data quantities and parallel SAR algorithm implementation.
OBJECTIVETo implement and accelerate simplified
range Doppler SAR processing algorithm on a modern NVIDIA TESLA GPU using CUDA and MATLAB-GPU capabilities.
The resulting research will explore the areas like:Algorithm adaptation for parallel
implementation.Suitability of MATLAB for algorithm
implementation.Suitability of CUDA for algorithm
implementation.Comparison of CPU/CUDA/MATLAB-GPU
implementations.GPU as SAR processing platform.
METHODOLOGYAlgorithm implementation and verification
on Intel Xeon CPU using MATLAB.Identification of parallelizable portions of
algorithm.Algorithm implementation on TESLA C1060
GPU using MATLAB’s native GPU capabilities.
Algorithm implementation on TESLA C1060 GPU using CUDA.
Analysis of CPU, MATLAB-GPU and CUDA implementations.
2.GPU ComputingIntroduction to GPU ComputingGPGPU: Brief HistoryNVIDIA CUDAWriting efficient code
Introduction to GPU ComputingUse of Graphics Processing Units (GPUs) for
general purpose computing applications.CPU: Single, four or eight cores. Capable of
handling few threads. Suitable for serial code.
GPU: Hundreds of cores. Capable of handling hundreds of threads. Suitable for parallel code.
Introduction to GPU ComputingGPU Computing Model: Heterogeneous
computing model employing both CPU and GPU with serial computing on CPU, parallel computing on GPU.
GPGPU: Brief HistoryFirst use of GPU as general purpose
computing device, around 1999-2000 using graphics APIs. Huge performance boosts observed. Generally unpopular due to tedious programming.
Introduction of NVIDIAs “CUDA” and AMDs “Stream Computing” in 2007. Beginning of modern GPGPU era. Other vendors introduced their own GPGPU systems.
NVIDIAs CUDA gaining popularity due to its maturity and performance.
NVIDIA CUDACompute Unified Device Architecture.Comprises of Instruction Set Architecture
(ISA) and parallel compute engine in GPU programmable with high level languages extended for GPU computing.
CUDA framework comprises of two parts; hardware and software. From software perspective, CUDA means extended C/C++, FORTRAN to support GPU computing.
CUDA is “Single Instruction Multiple Thread” (SIMT) architecture.
CUDA HardwareStreaming multiprocessor (SM): Basic computing unit of
the GPU. Comprises of eight streaming processors (SP) and memory. Different GPUs differ in number of SMs and SP clock frequency.
SP SP
SP SP
SP SP
SP SP
SFU SFU
MT IU
Shared Memory
CUDA Memory ArchitectureUnderstanding of memory architecture
critical for writing efficient CUDA programs.All CUDA-enabled hardware have following
types of memory:Global memoryShared memory and registers.Texture memory and texture cache.Constant memory and constant cache.Local memory for register spilling.
SP SPShared memory
SP SP SP
Texture cache
Constant cache
SM n
SP SPShared memory
SP SP SP
Texture cache
Constant cache
SM 3
SP SPShared memory
SP SP SP
Texture cache
Constant cache
SP SPShared memory
SP SP SP
Texture cache
Constant cache
SM 1SM 2
GPU
Global memory (RAM)
Local MemoryTexture memory Constant memory
NVIDIA TESLA C1060 GPUPCI Express 2.0 compliant computing
processor board based on NVIDIA Tesla T10 graphics processing unit targeted for HPC applications. Feature highlights30 SMs = 240 SPs.SP Clock = 1.296 GHz4 GB DDR3 memory with 120
GB/s bandwidth. IEEE 754 single and double
floating point compliant.933 GFLOPS single and 78
GFLOPS double precision performance.
Compute capability: 1.3Supported by MATLAB for GPU
computing
CUDA Programming ModelAt its core are thread groups, shared
memory and barrier synchronization.Provides coarse-grained data and task
parallelism and fine-grained data and thread parallelism providing expressivity and scalability.
Thread hierarchy: Grid, blocks, threads.Kernels: Functions executed on device
(GPU) in parallel threads.CUDA provides APIs to run and launch
kernels in parallel threads and to synchronize them.
Processing FlowCopy input data from CPU to GPU memory.Load GPU program and execute, caching
result on the device.Copy results from GPU to CPU.
RAM
CPU
Host
Global memory
Constant
Texture
GPU
Device
Writing Efficient CodeHigh priority considerations
Minimum CPU-GPU transfers.Use of coalesced data transfers.Use of shared memory instead of global
memory whenever possible.Avoiding different execution paths within a
warp.Medium priority considerations
Access to shared memory should be planned to avoid serialization.
Redundant data transfers from global memory should be avoided.
Writing Efficient CodeThreads per block should be multiple of 32.Use of fast math library whenever possible.
Low Priority ConsiderationsUse of zero copy operations.For kernels with long argument list, some
argument should be placed in constant memory.
Expensive modulo, division operations should be avoided in favor of shift operations whenever possible.
Automatic conversion of double to float should be avoided.
Loop unrolling should be used whenever possible.
3.SAR ProcessingWhat is Synthetic Aperture RadarSAR ProcessingProcessing AlgorithmsBasic RDASimplified RDA
What is Synthetic Aperture RadarAn active microwave remote sensing imaging system.Employs long range propagation characteristics of radar
and complex signal processing techniques to produce high resolution images.
High resolution achieved by synthesizing long antenna aperture through signal processing techniques.
Pros (in comparison with optical systems):All weather and day and night operation.No effects of constituents of atmosphere.Sensitivity to dielectric properties (can image ice, biomass
etc.)Sensitivity to surface roughness (oceans, wind speed etc.)
What is Synthetic Aperture Radar
Accurate measurement of distance.Sensitivity to man made objects.Sensitivity to target structure.Subsurface penetration.
Cons Complex interactions (difficult to visualize
and understand)Speckle effects (difficult in visual
interpretation)Topographic effects
SAR ProcessingA set of procedures to obtain interpretable image
from raw scattered in azimuth and range directions.In range, data is scattered by duration of transmitted
FM pulse.In azimuth, data spread by duration point target is
illuminated by the radar beam. SAR processing compresses this data taking into
account range cell migration, earth curvature, earth rotation, air/spacecraft attitude noise to produce the final image.
Given nature of SAR system and signals, signal processing rather than image processing provide appropriate tools for SAR processing.
SAR Processing AlgorithmsMainstream SAR processing include:
Range Doppler algorithm (RDA)High resolution images for low squint and for
relatively smaller aperture sizes. Very popular.Chirp scaling algorithm (CSA)
Two-dimensional operations with range independence followed by range corrections in range Doppler domain.
Omega-K algorithm (ωKA)Efficient and accurate in two-dimensional frequency
domain.SPECAN algorithm
Good for medium to low resolution requirements.
Range Doppler AlgorithmVersions of range Doppler:
Basic RDARDA with accurate SRCRDA with approximate SRCSimplified range Doppler
Basic RDARaw data Range
Compression Azimuth FFT
RCMCAzimuth Compression
Azimuth IFFT and lookup Summation
Final Image
Range FFT, matched filter multiply, range
IFFT
Data in range Doppler domain
Interpolation operation in
range Doppler domain
Azimuth matched filter
multiply
To bring back signal into time
domain.
Simplified RDAFor narrower swath width and medium
resolution requirements, RCM can be assumed independent of range.Raw data Pre-filtering Range
Compression
Azimuth FFTRCMCRange IFFT
Azimuth Compression
Azimuth IFFT and lookup Summation
Final Image
To remove Doppler centroid
Range FFT, matched filter multiply (No range IFFT)
Both range and azimuth in frequency domain
RCM phase function
multiply with each range line
Data in range Doppler domain
4.ImplementationHardware resourcesSoftware resourcesCPU ImplementationMATLAB GPU ImplementationCUDA ImplementationResult Comparison
Hardware resourcesCPU GPU
Name NVIDIA Tesla C1060
# of cores 240SP Clock 1.296 GHzMemory 4 GB GDDR3Maximum memory bandwidth
102 GB/s
Memory interface
512 bit – PCI Express
GFLOPS 933 single precision, 78 double precision
Name Intel Xeon E5504
CPU Clock 2 GHz# of cores 4System Memory
4 GB
DDR3 Clock 800 MHzMaximum memory bandwidth
19.2 GB/s
Memory type DDR3 PC3PCI Slot PCI Express
Software resourcesCPU GPUWindows 7
Ultimate 64-bitMATLAB release
2010bVisual Studio 2008
SP1
CUDA Toolkit 4.1MATLAB release
2010b NVIDIA Parallel
NsightVisual ProfilerCUDA MEMCHECKCUFFT library
RADARSAT – I Data• CEOS Format• Raw data is required to
be extracted from CEOS data before SAR processing algorithm can be applied.
Parameter Value UnitsSampling rate 32.317 MHzRange FM rate 0.7213
5MHz/µs
Pulse duration 41.74 µsRadar frequency 5.3 GHzRadar wavelength
0.05657
m
Pulse repetition frequency
1256.98
Hz
Effective radar velocity
7062 m/s
Azimuth FM rate 1733 Hz/sDoppler centroid -6900 Hz
Table RADARSAT – I data parameters
CEOS data
CEOS data extraction
utility
RAW SAR data
SAR Processing GUIFunctions• CEOS data
extraction.• MATLAB-
CPU SAR processing.
• MATLAB-GPU SAR processing
• CUDA input/output manipulation.
• CUDA program execution.
CPU ImplementationImplemented using MATLABFFT/IFFT using standard MATLAB functions
CPU Processed SAR image
A 2048 x 4096 SAR image using CPU based implementation
MATLAB-GPU ImplementationMATLAB started supporting GPU computing since
MATLAB release 2010b. Implemented using native MATLAB-GPU functions
only (no CUDA kernel calls).Vectorization strategy employed to implement
vector-matrix multiplications on GPU.
All FFT/IFFTs performed using MATLAB-GPU FFT/IFFT support functions.
Column 1
Column 2
………...
Column n
Column 1
Column 2
………...
Column n
Column 1
Column 2
………...
Column n
MATLAB-GPU ImplementationLimit on maximum image size that can be
calculated due to GPU memory constraints.
MATLAB-GPU ImplementationSpeedup as high as 21 achieved compared
with CPU implementation
MATLAB-GPU Implementation
A 2048 x 4096 SAR image using MATLAB-GPU based implementation
MATLAB-GPU ImplementationAdvantages
Quick and easy to implementSufficient speedups obtained with little effortLittle knowledge of GPU hardware and no
knowledge of optimization techniques required.Disadvantages
Currently, limited number of MATLAB functions supported on GPU.
Not all overloads of a function available for GPU.Lesser control of hardware resources and
memory.Not many optimization options.
CUDA ImplementationStrategy
Signal data read as binary fileVectors, matched filters calculated on CPUVectors/signal data transferred to GPUFollowing kernels executed in order on GPU
Pre-filtering kernelRange compression kernelRCMC kernelAzimuth compression kernelImage pixel calculation kernel
Data transferred from GPU to CPU and saved on disk as image.
Optimization considerationsChosen block size = 8 × 8 = 64. Conforms
with memory coalescing requirements.Constant variables stored in constant
memoryLocal variable and phase function
calculation whenever possible to reduce global memory access.
CPU-GPU data transfer kept to minimum by transferring data from CPUGPU at beginning and GPUCPU transfers at the end of algorithm.
Using CUFFTs cufftPlanMany() plan for FFT/IFFTs along data columns.
CUDA Implementation Results
A 2048 x 4096 SAR image using CUDA based implementation
CUDA Implementation Results
CUDA Implementation Results
CUDA/MATLAB-CPU/MATLAB-CPU Computation Time Comparison
MATLAB-GPU/CUDA Computation Time Comparison
MATLAB-GPU/CUDA speedup comparisonSpeedups as high as 53 times achieved in
comparison with maximum speedup of 21 times in MATLAB.
5. Conclusions & Future Work
ConclusionsFeasibility of GPU for SAR processing
Amount of data, computational effort and inherent algorithm parallelism makes SAR processing suitable on GPU.
TESLA C1060 GPU offers enough memory to handle various common SAR image sizes.
Cooling GPU may be a challenge in some environments.
Scalability of CUDA will prove to be an advantage to port existing SAR code to newer GPUs.
GPUs might not be suitable where customizable hardware is required or military hardware standards are to be adhered.
ConclusionsMATLAB-GPU based SAR Processing
Significant speedups compared with CPU.Quick and easy to implement.Has some limitations:
Currently have lesser function support for GPU. Expected to improve with future MATLAB releases.
Vectorization strategy needs more memory. Future release promise to take away need for vectorization (e.g. bsxfun in release 2012a).
Lesser control over GPU resources (memory etc.).CUDA SAR Processing
CUDA: Flexible and scalable with least learning curve.More control over GPU resources.Optimization strategies can be applied.Faster and more memory efficient than MATLAB
implementation.
ConclusionsDownsides of GPU
Significant testing/verification effort might be required if GPU hardware have to be upgraded (due to old one becoming obsolete).
Proprietary nature of CUDA might be problematic in case company discontinues CUDA or its support.
Future workCUDA kernels can be called in MATLAB code
using MATLAB’s CUDA kernel calling support.
MATLAB GPU implementation can be improved as newer and better functions become available.
C/C++ based CPU implementation can be developed to better judge MATLAB-CPU/CUDA performance.
Other SAR processing algorithms can be implemented using framework laid out in this project.
Q & A
Thank You
Recommended