21
Powering Real-time Radio Astronomy Signal Processing with GPUs Design of a GPU based real-time backend for the upgraded GMRT Harshavardhan Reddy Suda NCRA, India Pradeep Kumar Gupta NVIDIA, India

Powering Real-time Radio Astronomy Signal Processing with GPUson-demand.gputechconf.com/gtc/2013/presentations/S... · Powering RT Radio Astronomy Signal Processing | GTC 2013 Author:

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Powering Real-time Radio

Astronomy Signal Processing with

GPUsDesign of a GPU based real-time backend for the upgraded GMRT

Harshavardhan Reddy Suda

NCRA, India

Pradeep Kumar Gupta

NVIDIA, India

Collaborating teams :

NCRA, Pune, India :

• Yashwant Gupta

• B. Ajithkumar Nair

• Harshvardhan Reddy Suda

• Sanjay Kudale

Swinburne University, Australia :

• Andrew Jameson

• Ben Barsdell

NVIDIA, India :

• Pradeep Kumar Gupta

Acknowledgements :

Matthew Bailes (Swinburne)

Jayanta Roy (NCRA)

Amit Bansod (IIT Bombay)

Dharam Vir Lal (NCRA)

Outline :

Introduction

• Overview of the GMRT

• GMRT receiver system

Digital Back-end for the GMRT

• Existing Digital Back-end

• Upgrade specifications and compute requirements

Upgraded Digital Back-end

• Design and Development

• GPU and IO performance

• Results

Future Prospects

Introducing the GMRT

The Giant Meter-wave Radio Telescope (GMRT) is a world class instrument for studying astrophysical phenomena at low radio frequencies (50 to 1450 MHz)

Located 80 km north of Pune, 160 km east of Mumbai

Array telescope with 30 antennas of 45 m diameter, operating at meter wavelengths --the largest in the world at these frequencies

Operational since 2001

Frequency range : • 130-170 MHz

• 225-245 MHz

• 300-360 MHz

• 580-660 MHz

• 1000-1450 MHz

Effective collecting area (2-3% of SKA) : • 30,000 sq m at lower frequencies

• 20,000 sq m at highest frequencies

Supports 2 modes of operation :

• Interferometry, aperture synthesis

• Array mode (incoherent & coherent)

GMRT is used by astronomers from all over the world for various kinds of astrophysical studies

The GMRT : A Quick Overview

14

km

1 km x

1 km

The Giant Meter-wave Radio Telescope

A Google eye view

GMRT Receiver System

Dual polarized feeds

Super-heterodyne receiver chain : IF & baseband sections

Tunable LO (30 – 1700 MHz)

Maximum IF bandwidth : 32 MHz

Digital Back-end : correlator (for imaging) + beamformer (for pulsars studies)

Digital Back-end

ADC ADC ADC ADC

Ant 1 Ant 2 Ant 3 Ant M

Delay

Correction

(Integer

Clocks)

Delay

Correction

(Integer

Clocks)

Delay

Correction

(Integer

Clocks)

Delay

Correction

(Integer

Clocks)

FFT FFT FFT FFT

Phase

Correction

Phase

Correction

Phase

Correction

Phase

Correction

Multiply and Accumulate Beamformer

Data Storage and Analysis

Existing Digital Back-end : GMRT

Software Backend

Roy et al (2010)

Software based back-ends :• Few made to order hardware

components ; mostly off-the-shelf items

• Easier to program ; more flexible

The GMRT Software Back-end

(GSB) :• 32 antennas

• 32 MHz bandwidth, dual polarisation

• Net input data rate : 2 Gsamples/sec

• FX correlator + beam former

• Uses off-the-shelf ADC cards, CPUs &

switches to implement a fully real-time

back-end

• Current status : now working as the

observatory back-end

Looking ahead : The GMRT upgrade

Seamless frequency coverage from ~ 30 MHz to 1500 MHz

Increased instantaneous bandwidth of 400 MHz (from the present maximum of

32 MHz) modern new digital back-end receiver

Sampler

Fourier Transform

O(NlogN)

Phase

Correction

MAC

M(M+1)/2

GSB Upgraded digital

back-end

32 MHz BW 400 MHz BW

2k point FFT –

181 GFlops

16k point FFT –

2.9 TFlops

8.5 GFlops 0.1 TFlops

560 GFlops 6.6 TFlops

Antenna

Signals(M=64)

Why GPUs?

Total computation requirement for GSB ~750 GFlops

For upgraded digital backend ~10 TFlops

With the peak single precision floating point performance of both Fermi C2050

and K20, ten of Fermi C2050s or three of Kepler K20s should be enough

provided IO requirement can been handled.

IO requirement for upgraded digital back-end ~25 GB/s

Upgraded Digital Back-end design

Switch

(Infiniband)

Antenna 1

(400MHz

2pols)

FPGA

(packetizer)

CPU+GPU

(correlator)

Data acquisition

and control

ADC

(2 channels)

Antenna 2

(400MHz

2pols)

FPGA

(packetizer)

CPU+GPU

(correlator)

ADC

(2 channels)

Antenna 32

(400MHz

2pols)

FPGA

(packetizer)

CPU+GPU

(correlator)

ADC

(2 channels)

Upgraded Digital Back-end development

Xilinx virtex-5 FPGA boards with

ADC cards connected

CPU hosts : DELL T7500;

Myricom 10GbE NICs; Infiniband

interconnect using 8-port Mellanox

Switch

GPU cards : Tesla C2050 and K20Work done in collaboration with NVIDIA, India

and Swinburne University, Australia

Prototype 8 antenna system

Benchmarking results

Operation Performance (GFLOPS)

FFT (2k point) 330

FFT (4k point) 322

FFT (8k point) 233

Phase shifting 167

MAC 340

Overall sustained performance with Fermi C2050 is nearly 1/3rd of the peak

single-precision floating point performance

Total number of Fermi C2050s required for the full correlator ~ 30

Kepler K20 benchmarking – 33% improvement over Fermi C2050

Number of GPUs requirement reduces to approximately twenty

Optimizing the code for K20 will further reduce the number of GPUs required

Benchmarking on single Fermi (C2050)

I/O consideration and MPI performance

From each FPGA board to GPU host the data rate is constant 800MBps

(for 200 MHz BW, 8 bits/sample; 400 MHz BW, 4bits/sample)

Between GPU hosts, the fraction data (for each node) to be shared with

other nodes increases with the number of nodes(M) as (M-1)/M

Bi-directional BW achieved on 10 GbE interconnect between four nodes –

1.3 GBps. This is not sufficient for more than four nodes.

Benchmarks on GPU-cluster with Mellanox Infiniband interconnect gives

~5GBPS. For 32-node cluster – memory transfer takes 32% of time.

Data I/O considerations

50

75

87.5

93.7596.875

0

20

40

60

80

100

120

2 4 8 16 32

% of data

shared

No. of nodes vs % of data

shared

No. of nodes vs % of time for data

sharing

Software Model : OpenMP, MPI and CUDA

Code uses OpenMP, parallel

processing done with multiple

threads.

Main thread

Data Acquisition Data Processing

Shared memory

read

MPI transfer

Correlation

Visibility dump

GPU features used

- Asynchronous data transfer with streams

- Pinned memory to achieve high H2D transfer

bandwidth

- Shared memory to enhance MAC

- CUFFT library

First light image from GPU

Correlator

Image of 3C147 made from 4 hrs

of observations with 8 antenna

inputs (single polarisation)

RF : 1280 MHz; BW : 30 MHz

RMS noise : 7 mJy

Sample result from new

wideband signal path

First GMRT image using 100

MHz RF BW at L-band

RMS noise : 3 mJy

Courtesy Sanjay Kudale and Dharam Vir Lal

Future Plans

Build a 16-node cluster for 30 antennas of GMRT with each node

having two10GbE cards for data acquisition, latest generation

Kepler GPU cards for processing and infiniband network for

internode data transfer

Implement RFI filtering and Digital Down Conversion schemes

Tune the code and do performance benchmarking (DevTech

support from NVIDIA)

Explore GPU features

GPUDirect RDMA

Increased resources in Kepler

Proposed Plan : 400MHz BW dual pol 32 antennas

ROACH 1

CPU-GPU Node1

CPU-GPU Node2

CPU-GPU Node16

INFINIBAND SWITCH (40 GbPS)

800M

iB/S

800M

iB/S

800M

iB/S

800M

iB/S

800M

iB/S

800M

iB/S

ROACH 2

ROACH 3

ROACH 4

ROACH 31

ROACH 32

Server Machine

That’s all !

Thank You