HardwareAccelerationofPulsarSearchonFPGAs usingOpenCL · KKerenrenel lP iPpipeleinlinee K er nlsP ip 2GB DDR3 x 2 Memory (Global Memory) CIe Core 1 Cor4 Host Memory (DDR3 and SSD)

Overview and TaskFT Convolution Decomposition

High-level Techniques and ImplementationEvaluation

What’s Next

Hardware Acceleration of Pulsar Search on FPGAsusing OpenCL

Oliver Sinnen Haomiao Wang&

Prabu Thiagaraj (Manchester Uni)

Parallel and Reconfigurable Computing

Department of Electrical and Computer EngineeringUniversity of Auckland

Computing for SKA, 2017

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL



What’s Next

Strong-field Test of Gravity using Pulsars

Image Credit: NASA/Tod Strohmayer (GSFC)/Dana Berry (Chandra X-Ray Observatory)

Image credit: NASA .




What’s Next

Outline

1 Overview and Task

2 FT Convolution Decomposition

3 High-level Techniques and Implementation

4 Evaluation

5 What’s Next




What’s Next

Outline

1 Overview and Task



4 Evaluation

5 What’s Next




What’s Next

Pulsar and Pulsar Search

Observed radiation is a pulse Binary pulsar (Doppler effect)Acceleration search: 1) Time-domain 2) Frequency-domain

.




What’s Next

Pulsar and Pulsar Search

Frequency-domainUsing matched filtering technique in Fourier domain torecover the signal into single bin.

Ar0 w[r0]+m/2

∑k=[r0]−m/2

AkA∗r0−k ,

where frequency r0 is unknown.Summation is computed at a range of frequencies r .




What’s Next

Block Overview of Pulsar Search Engine

Data Receptor(RCPT)

BeamformedData(BFD)

Dedispersion Buffer Creator

(DDBC)

FilterbankData Chuncks

(FDC) RFI Mtigation(RFIM)

DedispersionBuffers(DB) Dedispersion

Transform(DDTR)

Flagged DB(FDB) Periodicity

Search Buffer Creator (PSBC)

Dedispersed Data Buffer

(DDB)

Complex Fourier

Transform (CXFT)

Periodicity Search Buffer

(PSB)

Birdie Zapping(BRDZ)

Dereddening Spectrum(DRED)

Inverse Complex Fourier Transform

(iCXFT)

Single Pulse Detector (SPCT)

Dedispersed Data Buffer

(DDB)

Single Pulse Sifter

(SPSIFT)

Single Pulse Optimiser(SPOPT)

Candidate Data Output Streamer(CDOS)

To SDP

Fourier Domain Acceleration Search

(FDAS)

Time Domain Resampler Transform(TDRT)

Fourier Transform and Power

Spectrum (PWFT)

Harmonic Summing (HRMS)

Time Domain Candidate

Optimisation(TDAO)

Candidate Sifting(SIFT)

Full Filterbank Buffer Creator

(FFBC)

Candidate Folding and Optimsation

(FLDO)

Fourier Domain Candidate

Optimisation(FDAO)

Filterbank Data for Selected SP Candidates

Filterbank Data for Candidate

FoldingFrom SDP

Common

Single Pulse

Time‐domain Acc

Freq‐domain Acc




What’s Next

Fourier-domain Acceleration Search (FDAS)

FDAS module is applied to search for (binary) pulsars with constantfrequency derivatives in frequency-domain

Beam2

Beami

......

......

Over 2,000 beams are formed at 4,096

channels/beam

Beami signals are de-dispersed for 6,000 DMs

FIR_1

FIR_k... ...

FIR_85

Post-processing

PSS Engine_i

FT Convolution Module

BeamN

DM1DM1

DM2...

DMj

DM6000

...

85 FIR filters, maximum length is 421-tap

Pre-Processing

.RFIM .DDTR .PSBC .CXFT .BRDZ .DRED

· Single Pulse Search Modules · Time Domain Acceleration or

Harmonic-sum

Module

FDAS Module




What’s Next

Specification of Task

Parameter Destriiption ValueB # of beams 1000∼ 2000

DM # of de-dispersion measure (DM) trails 6000Tobs Observation period 540stlimit Time of executing one sample group 88msN # of complex samples per group 222

M # of templates/filter 85K # of average template/filter length > 200

.




What’s Next

Outline

1 Overview and Task



4 Evaluation

5 What’s Next




What’s Next

FT Convolution

Complex floating-point operationsMultiple long FIR filtersLarge input sizeStrict time limit

Number of acceleration devices (CapEx)

Energy consumption (OpEx)




What’s Next

Basic Element

Time-domain FIR Filter (TDFIR)

ym[i ] =K−1

∑k=0

xm[i −k]hm[k], for i = 0,1, ...N−1

Frequency-domain FIR Filter (FDFIR)

F{f ∗h}= F{f } ·F{h}




What’s Next

Hardware Limitation

Naïve Time Domain

DSP

block Single precision floating-point (SPF) multiplications(A+ iB)× (C + iD) = (A×C −B×D)+ i(A×D +B×C)

Naïve Frequency Domain

Off-chip (global) memory

Off-chip memory bandwidth

RAM block

On-chip (local) memory size4-Million elements = 32MBytes .




What’s Next

Decomposition Algorithms

Overlap-add Algorithm

Split the coefficient array–> OLA-TD

Overlap-save Algorithm

Split the input array–> OLS-FD

Coefficients C_1 C_2 C_N. . .

Input data Zero

Length = Ncoef /N -1

Output data_i

Output data_1

Output data

Output data_N

. . .

+Length = Ncoef /N

Split

Convolve with subset coefficient group i

Output data_2

PD_3PD_2

Input DataZero

Length =Ncoef -1

ID_1

ID_2

ID_3

ID_N...

ID_i PD_i

Convolution with FIR filter

Discard the Ncoef-1 elements

Output Data

PD_1 ... PD_N

Split the input into N small groups

(a) OLA (b) OLS




What’s Next

Outline

1 Overview and Task



4 Evaluation

5 What’s Next




What’s Next

High-level Techniques

Maxeler MaxCompiler using Java to develop FPGA (HPCC2016)

Open Computing Language (OpenCL) for FPGAs (Intel FPGA Cards),GPUs, and CPUs (FPT2016, best paper candidate)

DDR Controller & PHY

Global Memory Interconnect

Local Memory Interconnect

FPGA_0

PCIe

Block RAM

Kernel PipelineKernel PipelineKernel PipelineKernel PipelineKernels Pipeline

2GB DDR3 x 2 Memory

(Global Memory)

PCIe

Core 1

Core 4

Host

Memory(DDR3 and SSD)

...

...DDR Controller & PHY

Global Memory Interconnect

Local Memory Interconnect

FPGA_i

PCIe

Block RAM

Kernel PipelineKernel PipelineKernel PipelineKernel PipelineKernels Pipeline

...




What’s Next

Kernel Structures–OLA




What’s Next

Kernel Structures–OLS

Global Memory (Bank1)

Global Memory (Bank2)

ChannelsFFT Data Fetch

Kernel(NDRange)

FFT and MultiplicationKernel (Single)

ChannelsFFT Bit-Reverse

Kernel(NDRange)

Chan

nels

IFFT Data FetchKernel

(NDRange)

IFFTKernel(Single)

IFFT Bit-Reverse Kernel

(NDRange)

Channels Channels

Output

InputProcessed

coefficients

Output

Global Memory 1st launch: Bank1 2nd launch: Bank2

ChannelsData Fetch and

Multiplication Kernel(NDRange)

FFT/IFFT Kernel (Single)1st launch FFT

2nd launch IFFT

ChannelsBit-Reverse

Kernel(NDRange)

Global Memory 1st launch: Bank2 2nd launch: Bank11st launch

2nd launch

Processed coefficients

2nd launch

Input1st launch Switch

AOLS

TOLS

.Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL



What’s Next

Outline

1 Overview and Task



4 Evaluation

5 What’s Next




What’s Next

Platform

Table: Details of FPGA and GPU Platforms

Device (Board) Terasic DE5-Net Sapphire Nitro R7 370

Hardware Intel Stratix V 5SGXA7 AMD Radeon R7 370

Technology 28nm 28nm

Compute resource622,000 LEs

1024 Stream Processors256 DSP blocks

On-chip memory size 50Mb —

Global memory size 2 x 2GB DDR3 3GB GDDR5

Global memory frequency 800MHz 5,600MHz

Memory interface width 2 x 64-bit 256-bit

Max clock frequency — 985MHz

OpenCL 1.0 1.2

Max power consumption — 150W




What’s Next

Latency–TDFIR vs FDFIR

4 TDFIR Kernels

Naïve

TD-Naïve-64STD-Naïve-64N

OLA

OLA-64SOLA-64N

5 FDFIR Kernels

Naïve

FD-Naïve

OLS

AOLS

AOLS-1024AOLS-2048AOLS-4096

TOLS

TOLS-1024




What’s Next

Latency–TDFIR vs FDFIR

Latencies of a single FPGA (Intel Stratix V A7) in processing sameinput array using 9 different OpenCL kernels:

●

●

●

●

64 128 256 421

0

50

100

150

200

250

FIR Filter Length

Ker

nel E

xecu

tion

Late

ncy

(ms)

● TD−Naïve−64STD−Naïve−64NOLA−64SOLA−64NFD−NaïveAOLS−1024AOLS−2048AOLS−4096TOLS−1024




What’s Next

Multiple FIR Filters

Even fastest kernel cannot meet time limit

=> Implement multiple FIR filters in parallel

Problem:Bandwidth of off-chip memory is main problemSolution:

Do more processing !Calculate power of complex values (need input in next stage)

Problem:Number of DSP blocks limits number of parallelisable filtersSolution:

Downscale the FFT engine input size: 8 points–> 4 points




What’s Next











What’s Next











What’s Next


1) Multiple FIR filters 2) Power of complex valuesBecomes optimisation problem!

TOLS-1024 8points

2 x AOLS-1024 8points

2 x AOLS-1024-P 8points

AOLS-1024-P 8points

AOLS-1024 8points

AOLS-1024-P 4points

AOLS-2048-P 4points



FFT Engine Element-wise multiplications PowerUnused DSP blocks




What’s Next

Multiple FIR Filters and FPGAs

0

200

400

600

3xAOLS−1024−P 3xAOLS−2048−P 3xAOLS−4096−POpenCL Kernels

late

ncy

of 8

4 F

IR F

ilter

s (m

s)

Device1 FPGA

2 FPGAs

3 FPGAs

0

100

200

300


Ker

nel P

efor

man

ce (

GF

LOP

S)

Device1 FPGA

2 FPGAs

3 FPGAs

0

1

2

3

4


Pow

er E

ffici

ency

(G

FLO

PS

/wat

t)

Device1 FPGA

2 FPGAs

3 FPGAs

0

5

10

15

20


Ene

rgy

Dis

sipa

tion

(Jou

le)

Device1 FPGA

2 FPGAs

3 FPGAs




What’s Next

Latency–FPGA vs GPU

Latencies of a single GPU (AMD Radeon R7 370) and 3 FPGAs inprocessing 2 and 4 Million points:

●●●●●●●●●

●●●●●●●●●

●●●●●●●●●

●●●●●●●●●

●●●●●●●●●

●●●●●●●●●

●●●●●●●●●

●●●●●●●●●

●●●●●●●●●

●●

0 9 18 27 36 45 54 63 72 81

0

20

40

60

80

100

120

140

Total FIR Filters

Ker

nel E

xecu

tion

Late

ncy

(ms)

● 3xAOLS−2048−P (4−Million)3xAOLS−2048−P (2−Million)GPU−FD (4−Million)GPU−FD (2−Million)




What’s Next

Energy–FPGA vs GPU

Energy dissipation of single FPGA and GPU in executing the sametask with different kernels:

GPU−FD

3xAOLS−1024−P

3xAOLS−2048−P

3xAOLS−4096−P

AOLS−2048

TOLS−1024

0 5 10 15 20

Energy Dissipation (Joule)




What’s Next

Outline

1 Overview and Task



4 Evaluation

5 What’s Next




What’s Next

Harmonic-summing

Input: Filter-output-plane (FOP, 85×222 SPF points,~1.33GBytes)Processing flow:

8 harmonic planes are generated based on FOP and thestretched planesOne threshold for each row of each harmonic plane (overall:85×8= 680)Hundreds of candidates are recorded

Output: Candidates

each candidate contains the indexes of filter, harmonic, bin,and amplitude (up to 64-bit)




What’s Next

Harmonic-summing

Problems:Input data size is too large (~1.33GBytes)

On-chip memory size too small for all planes

The cost of computation task is very cheap (SPF adds) andeasy to parallelise

Off-chip memory bandwidth is issue

Challenge:Optimise data use (and reuse), computation not an issue




What’s Next

Conclusion

FPGA-based implementation and optimisation of FTconvolution (FIR filter), based on OLA and OLS algorithmsHigh-level approaches to such tasks works well

Covered large design spaceEasy porting and sharing with partners

With multiple FPGAs, FPGA implementation has advantageover GPU in both performance (GFLOPS) and Energyefficiency


Documents

HardwareAccelerationofPulsarSearchonFPGAs usingOpenCL · KKerenrenel lP iPpipeleinlinee K er nlsP ip 2GB DDR3 x 2 Memory (Global Memory) CIe Core 1 Cor4 Host Memory (DDR3 and SSD)