Download pdf - GTC 2017 Green Flash - NVIDIA · 2017. 5. 16. · GTC 2017 RTC Real Time controller Legacy architecture IE. SPARTA architecture – DSP & CPU – VXS backplane Instrument WFS meas

Project #671662 funded by European Commission under program H2020-EU.1.2.2 coordinated in H2020-FETHPC-2014

GTC 2017Green Flash

Persistent Kernel : Real-Time, Low-Latency and High-Performance Computation on Pascal

Julien BERNARD

Green Flash

● Public and private actors– Paris Observatory– University of Durham– Microgate– PLDA

● Part of Horizon 2020 : EU Researchand Innovation programme

● 3 years project● 3,8 million €● Involve about 30 people

● Research axes– Real time HPC with accelerators and smart interconnects– Energy efficient platform based on FPGA– Real Time Controller (RTC) prototype for European – Extremely Large Telescope Adaptive

Optics (AO) system

GTC 2017

Contributors

Maxime Lainé : software engineer

Denis Perret : FPGA expert

Arnaud Sevin : software lead

Damien Gratadour : project lead

Christophe Rouaud : PLDA project lead

Gaetan Dufourcq : QuickPlay expert

GTC 2017

E-ELT :Adaptive Optics

● Compensate in real-time the wavefront perturbations

● Using a wavefront sensor - WFS to measure them

● Using a deformable mirror – DM to reshape the wavefront

● Commands to the mirror must be computed in real-time (~ms rate)

GTC 2017

RTC concept for ELT AO

GTC 2017

RTC concept for ELT AO

GTC 2017

RTC

Real Time controller

Legacy architecture● IE. SPARTA architecture

– DSP & CPU

– VXS backplane

Instrument WFS meas. DM com. Freq (Hz)

Performance (GMAC/s)

Sphere 1 2.6K 1 1.3k 1.5k 5.2

AOF 4 2.4k 1 1.2k 1k 11.8

Active elements

Sensor

Switch

GTC 2017

RTCNode 0

RTCNode N-1

RTCNode ...

Real Time controller

Cluster network architecture

Instrument WFS meas. DM com. Freq (Hz)

Performance (GMAC/s)

Sphere 1 2.6K 1 1.3k 1.5k 5.2

AOF 4 2.4k 1 1.2k 1k 11.8

ELT 6 80k 3 15k 500 1.2k

Sensor 0

Active elements 0

Active elements 1

Sensor 2

Sensor 3

Sensor 4

Sensor 5

Sensor 1

Active elements 2

Switch

GTC 2017

Legacy GPU programming

GPUGPU RAM

CPUCPU RAM

PCIe

10GbE NIC

main { setup(); while(run){ recv(…); cudaMemcpy(…, HostToDevice); computing_kernel<<<>>>(…); cudaMemcpy(…, DeviceToHost); send(…); }}

GTC 2017


cudaMemcopy() overhead times (5.12Mo in, 64Ko out)

Kernel launches overhead times

Both cases : jitter of 20 to 30 µsec (40 µsec sometimes)

GTC 2017


Leaves not enough time for computations

GTC 2017

Improvement

GPU direct & I/O Memory mapping

Persistent Kernel

GTC 2017


GTC 2017


● FPGA writes/reads directly to/from GPU memory● CPU free for other kind of computations

FPGA NIC

Host ram

CPU app

Camera controlFPGA controlMeas. Comp.

GPU ram GPU

Camera protocol handler

DMADMC protocol

handler

DMA

UDP Offload Engine Pixels

buffer

DM com buffer

DMA

start

PCI-e 3.0

DMAanswers

Latency measurement

DMAmeasures

Pixels buffer

computekernels

GTC 2017

FPGA Development platform

Eased devel. Process using the QuickPlay tool from PLDA

GTC 2017

FPGA Development platform

● Single generic design / multiple target boards– ExpressK-US board

(hosting a Kintex UltraScale from Xilinx)

– ExpressGX V board (hosting a Stratix V from Altera)

– μXlink board from microgate (hosting a Arria 10 board from Altera)

GTC 2017

Persistent Kernel

GTC 2017

Classic implementation

GTC 2017

Persistent kernel implementation

GTC 2017

GPU direct, I/O Memory mapping & Persistent kernel

GPUGPU RAM

CPUCPU RAM

PCIe

10GbE FPGA NIC

start

main { setup(); persistent_kernel <<<>>>(…); …}

persistent_kernel(…){ while(run){ pollMemory(…); computation(...); startDMATransfer(…); }}

GTC 2017

Pipelining I/O and compute

FPGA PLDA XPressG5GPU Tesla C2070OS Debian wheezy

Camera EVT HS-2000M10GbE network

µsec

iterations

No GPUDirect

GPUDirect + persistent kernel

SCAO Pyramid case: 240 x 240 pixels, encoded on 16b

GTC 2017

Pipelining I/O and compute

GTC 2017

DGX-1 benchmark

● FPGA is replace by CPU

● Each node master receive frame data

● Work is shared between all devices

● RTC master send back final resut

Node mastersRTC Master

Slaves

GTC 2017

Result 1/2 : Time and jitterHistogram4 devices case with10,048 slopes x 15,000 commands

Average : 0.45ms

Jitter peak to peak : 17µs

Variation : 1.8 %

Time in ms

Result 2/2 : Sync & Intercom time

Intercommunication time Synchronize time

Average : 15µs Jitter : 8.8µsAverage : 24µs Jitter : 12µs

Conclusion & future work

● Conclusion– Using GPUDirect and a

persistent kernel allow efficient data delivery to the RTC

– Lower jitter

– Simpler execution stream

– QuickPlay tool from PLDA● Eased FPGA development cycle● Mix communication protocols and

data processing into the same streams

● Expandable ecosystem, with QuickStore / QuickAliance

● Future– Test on AO bench (with DM

and WFS)

– Use multi nodes architecture

– Test with fp16

Project #671662 funded by European Commission under program H2020-EU.1.2.2 coordinated in H2020-FETHPC-2014

Thank you

Question ?

GTC 2017

● DGX-1 benchmark● Result 1/2 : Time and jitter● Result 2/2 : Sync & Intercom time● Conclusion & future work● Thank you● RTC AO prototype for E-ELT● Test pipeline● Time measurement strategies● Conclusion : Persistent kernel● future work● New features● Test architecture