Project #671662 funded by European Commission under program H2020-EU.1.2.2 coordinated in H2020-FETHPC-2014
GTC 2017Green Flash
Persistent Kernel : Real-Time, Low-Latency and High-Performance Computation on Pascal
Julien BERNARD
Green Flash
● Public and private actors– Paris Observatory– University of Durham– Microgate– PLDA
● Part of Horizon 2020 : EU Researchand Innovation programme
● 3 years project● 3,8 million €● Involve about 30 people
● Research axes– Real time HPC with accelerators and smart interconnects– Energy efficient platform based on FPGA– Real Time Controller (RTC) prototype for European – Extremely Large Telescope Adaptive
Optics (AO) system
GTC 2017
Contributors
Maxime Lainé : software engineer
Denis Perret : FPGA expert
Arnaud Sevin : software lead
Damien Gratadour : project lead
Christophe Rouaud : PLDA project lead
Gaetan Dufourcq : QuickPlay expert
GTC 2017
E-ELT :Adaptive Optics
● Compensate in real-time the wavefront perturbations
● Using a wavefront sensor - WFS to measure them
● Using a deformable mirror – DM to reshape the wavefront
● Commands to the mirror must be computed in real-time (~ms rate)
GTC 2017
RTC concept for ELT AO
GTC 2017
RTC concept for ELT AO
GTC 2017
RTC
Real Time controller
Legacy architecture● IE. SPARTA architecture
– DSP & CPU
– VXS backplane
Instrument WFS meas. DM com. Freq (Hz)
Performance (GMAC/s)
Sphere 1 2.6K 1 1.3k 1.5k 5.2
AOF 4 2.4k 1 1.2k 1k 11.8
Active elements
Sensor
Switch
GTC 2017
RTCNode 0
RTCNode N-1
RTCNode ...
Real Time controller
Cluster network architecture
Instrument WFS meas. DM com. Freq (Hz)
Performance (GMAC/s)
Sphere 1 2.6K 1 1.3k 1.5k 5.2
AOF 4 2.4k 1 1.2k 1k 11.8
ELT 6 80k 3 15k 500 1.2k
Sensor 0
Active elements 0
Active elements 1
Sensor 2
Sensor 3
Sensor 4
Sensor 5
Sensor 1
Active elements 2
Switch
GTC 2017
Legacy GPU programming
GPUGPU RAM
CPUCPU RAM
PCIe
10GbE NIC
main { setup(); while(run){ recv(…); cudaMemcpy(…, HostToDevice); computing_kernel<<<>>>(…); cudaMemcpy(…, DeviceToHost); send(…); }}
GTC 2017
Legacy GPU programming
cudaMemcopy() overhead times (5.12Mo in, 64Ko out)
Kernel launches overhead times
Both cases : jitter of 20 to 30 µsec (40 µsec sometimes)
GTC 2017
Legacy GPU programming
Leaves not enough time for computations
GTC 2017
Improvement
GPU direct & I/O Memory mapping
Persistent Kernel
GTC 2017
GPU direct & I/O Memory mapping
GTC 2017
GPU direct & I/O Memory mapping
● FPGA writes/reads directly to/from GPU memory● CPU free for other kind of computations
FPGA NIC
Host ram
CPU app
Camera controlFPGA controlMeas. Comp.
GPU ram GPU
Camera protocol handler
DMADMC protocol
handler
DMA
UDP Offload Engine Pixels
buffer
DM com buffer
DMA
start
PCI-e 3.0
DMAanswers
Latency measurement
DMAmeasures
Pixels buffer
computekernels
GTC 2017
FPGA Development platform
Eased devel. Process using the QuickPlay tool from PLDA
GTC 2017
FPGA Development platform
● Single generic design / multiple target boards– ExpressK-US board
(hosting a Kintex UltraScale from Xilinx)
– ExpressGX V board (hosting a Stratix V from Altera)
– μXlink board from microgate (hosting a Arria 10 board from Altera)
GTC 2017
Persistent Kernel
GTC 2017
Classic implementation
GTC 2017
Persistent kernel implementation
GTC 2017
GPU direct, I/O Memory mapping & Persistent kernel
GPUGPU RAM
CPUCPU RAM
PCIe
10GbE FPGA NIC
start
main { setup(); persistent_kernel <<<>>>(…); …}
persistent_kernel(…){ while(run){ pollMemory(…); computation(...); startDMATransfer(…); }}
GTC 2017
Pipelining I/O and compute
FPGA PLDA XPressG5GPU Tesla C2070OS Debian wheezy
Camera EVT HS-2000M10GbE network
µsec
iterations
No GPUDirect
GPUDirect + persistent kernel
SCAO Pyramid case: 240 x 240 pixels, encoded on 16b
GTC 2017
Pipelining I/O and compute
GTC 2017
DGX-1 benchmark
● FPGA is replace by CPU
● Each node master receive frame data
● Work is shared between all devices
● RTC master send back final resut
Node mastersRTC Master
Slaves
GTC 2017
Result 1/2 : Time and jitterHistogram4 devices case with10,048 slopes x 15,000 commands
Average : 0.45ms
Jitter peak to peak : 17µs
Variation : 1.8 %
Time in ms
Result 2/2 : Sync & Intercom time
Intercommunication time Synchronize time
Average : 15µs Jitter : 8.8µsAverage : 24µs Jitter : 12µs
Conclusion & future work
● Conclusion– Using GPUDirect and a
persistent kernel allow efficient data delivery to the RTC
– Lower jitter
– Simpler execution stream
– QuickPlay tool from PLDA● Eased FPGA development cycle● Mix communication protocols and
data processing into the same streams
● Expandable ecosystem, with QuickStore / QuickAliance
● Future– Test on AO bench (with DM
and WFS)
– Use multi nodes architecture
– Test with fp16
Project #671662 funded by European Commission under program H2020-EU.1.2.2 coordinated in H2020-FETHPC-2014
Thank you
Question ?
GTC 2017
● DGX-1 benchmark● Result 1/2 : Time and jitter● Result 2/2 : Sync & Intercom time● Conclusion & future work● Thank you● RTC AO prototype for E-ELT● Test pipeline● Time measurement strategies● Conclusion : Persistent kernel● future work● New features● Test architecture