View
8
Download
0
Category
Preview:
Citation preview
S6357 Towards Efficient Communication Methods and Models for
Scalable GPU-Centric Computing Systems
Holger Fröning Computer Engineering Group
Ruprecht-Karls University of Heidelberg
GPU Technology Conference 2016
Aboutus
• JProf. Dr. Holger Fröning • PI of Computer Engineering Group, ZITI, Ruprecht-Karls University of Heidelberg • http://www.ziti.uni-heidelberg.de/compeng
• Research: Application-specific computing under hard power and energy constraints (HW/SW), future emerging technologies • High-performance computing (traditional and emerging) • GPU computing (heterogeneity & massive concurrency) • High-performance analytics (scalable graph computations) • Emerging technologies (approximate computing and stacked memory) • Reconfigurable logic (digital design and high-level synthesis)
• Current collaborations • Nvidia Research, US; University of Castilla-La Mancha, Albacete, Spain; CERN, Switzerland; SAP,
Germany; Georgia Institute of Technology, US; Technical University of Valencia, Spain; TU Graz, Austria; various companies
!2
TheProblem
• GPU are powerful high-core count devices, but only for in-core computations • But many workloads cannot be satisfied by a single GPU • Technical computing, graph computations, data-
warehousing, molecular dynamics, quantum chemistry, particle physics, deep learning, spiking neural networks
• => Multi-GPU, at node level and at cluster level • Hybrid programming models
• While single GPUs are rather simple to program, interactions between multiple GPUs dramatically push complexity!
• This talk: how good are GPUs in sourcing/sinking network traffic, how should one orchestrate communication, what do we need for best performance/energy efficiency
!3
Review:Messaging-basedCommunication
• Usually Send/Receive or Put/Get • MPI as de-facto standard
• Work requests descriptors • Issued to the network device • Target node, source pointer,
length, tag, communication method, ...
• Irregular accesses, little concurrency
• Memory registration • OS & driver interactions
• Consistency by polling on completion notifications
!4
Lena Oden, Holger Fröning, Infiniband-Verbs on GPU: A case study of controlling an Infiniband network device from the GPU, International Journal of High Performance Computing Applications, Special Issue on Applications for the Heterogeneous Computing Era, Sage Publications, 2015.
BeyondCPU-centriccommunication
!5
Source'Node'
GPU'
CPU'
NIC'PCIe'root'
GPU'memory'
Host'memory'
Target'Node'
GPU'
CPU'
NIC'PCIe'root'
GPU'memory'
Host'memory'
100x
Start-uplatencyof1.5usec
Start-uplatencyof15usec
GPU-controlled Put/Get (IBVERBS)
“… a bad semantic match between communication primitives required
by the application and those provided by the network.” - DOE Subcommittee Report, Top Ten Exascale Research Challenges.
02/10/2014
!5
GPUsratherincompatiblewithmessaging:•Constructingdescriptors(workrequests)
•Registeringmemory•Polling•Controllingnetworkingdevices
Communication orchestration - how to source and sink network traffic
Exampleapplication:3Dstencilcode
• Himeno 3D stencil code • Solving a Poisson equation using
2D CTAs (marching planes) • Multiple iterations using iterative
kernel launches • Multi-GPU: inter-block and inter-
GPU dependencies • Dependencies => communication
• Inter-block: device synchronization required among adjacent CTAs
• Inter-GPU: all CTAs participate communications (sourcing and sinking) => device synchronization required
!7
Control flow using CPU-controlled communication
3D Himeno stencil code with 2D CTAs
……
Differentformsofcommunicationcontrolforanexamplestencilcode
!8
Control flow using in-kernel synchronization
Control flow using stream synchronization (with/without nested parallelism)
Performancecomparison-executiontime
!9
0"0,1"0,2"0,3"0,4"0,5"0,6"0,7"0,8"0,9"1"
256x256x256"
256x256x512"
256x256x1024"
512x512x256"
512x512x512"
512x512x640"
640x640x128"
640x640x256"
640x640x386"
Perfom
ance**
Rela-v
e*to*hybrid
*app
roach*
in0kernel0sync" stream0sync" device0sync"
• CPU-controlled still fastest • Backed up by previous experiments
• In-kernel synchronization slowest • Communication overhead increases with
problem size: more CTAs, more device synchronization
• ~28% of all instructions have to be replayed, likely due to serialization (use of atomics)
• Stream synchronization a good option • Difference to device synchronization is
overhead of nested parallelism • Device synchronization most flexible
regarding control flow • Communication as device function or as
independent kernel • Flexibility in kernel launch configuration
Lena Oden, Benjamin Klenk, Holger Fröning, Analyzing GPU-controlled Communication and Dynamic Parallelism in Terms of Performance and Energy, Elsevier Journal of Parallel Computing (ParCo), 2016.
Performancecomparison-energyconsumption
!10
0"
0,2"
0,4"
0,6"
0,8"
1"
1,2"
256x256x256"
256x256x512"
256x256x1024"
512x512x256"
512x512x512"
512x512x640"
640x640x128"
640x640x256"
640x640x386"
Energy'con
sump.
on'
Rela.v
e'to'hybrid
'app
roach'
stream2sync" device2sync" in2kernel2sync"
• Benefits for stream/device synchronization as the CPU is put into sleep mode • 10% less energy consumption • CPU: 20-25W saved
• In-kernel synchronization saves much more total power, but execution time increase results in a higher energy consumption • Likely bad GPU utilization
Lena Oden, Benjamin Klenk, Holger Fröning, Analyzing GPU-controlled Communication and Dynamic Parallelism in Terms of Performance and Energy, Elsevier Journal of Parallel Computing (ParCo), 2016.
Communicationorchestration-Takeaways
• CPU-controlled communication is still fastest - independent of different orchestration optimizations
• GPU-controlled communication: intra-GPU synchronization between the individual CTAs is most important for performance • Stream synchronization most promising • Otherwise reply overhead due to serialization
• Dedicated communication kernels or functions are highly recommended • Either device functions for master kernel (nested parallelism), or communication
kernels in the same stream (issued by CPU) • Bypassing CPUs has substantial energy advantages
• Decrease polling rates, or use interrupt-based CUDA events! • More room for optimizations left
!11
while( cudaStreamQuery(stream) == cudaErrorNotReady ) usleep(sleeptime};
GGAS: Fast GPU-controlled traffic sourcing and sinking
GGAS–GlobalGPUAddressSpaces
• Forwarding load/store operations to global addresses • Address translation and
target identification • Special hardware support
required (NIC) • Severe limitations for full
coherence and strong consistency • Well known for CPU-based
distributed shared memory • Reverting to highly relaxed
consistency models can be a solution
!13
. Holger Fröning and Heiner Litz, Efficient Hardware Support for the Partitioned Global Address Space, 10th Workshop on Communication Architecture for Clusters (CAC2010), co-located with 24th International Parallel and Distributed Processing Symposium (IPDPS 2010), April 19, 2010, Atlanta, Georgia.
!14
GGAS–thread-collaborativeBSP-likecommunication
Lena Oden and Holger Fröning, GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters, IEEE International Conference on Cluster Computing 2013, September 23-27, 2013, Indianapolis, US.
Computa(on…
…Communica(onusingcollec(ve
remotestores
…Con(nue…
Globalbarrier
Computa(on
GPU0
… Computa(on
…Con(nue…
…
GPU1(remote)
GGAS–currentprogrammingmodelusingmailboxes
!15
<snip> ... remMailbox[getProcess(index)][tid] = data[tid]; __threadfence_system(); // memory fence remoteEndFlag[getProcess(index)][0] = 1; __ggas_barrier(); … <snip>
GGASPrototype
!16
RemoteloadlatencyVirtex-6:1.44–1.9usec
(CPU/GPU)Node #0 (Source) Issuing loads/stores
Node #1 (Target) Memory host
Source-localaddressTargetnodedetermination
Addresscalculation
GlobaladdressLoss-lessandin-orderpacketforwarding
Target-localaddressSourcetagmanagement
AddresscalculationReturnroute
• FPGA-based network prototype • Xilinx Virtex-6 • 64bit data paths, 156MHz =
1.248GB/s (theoretical peak) • PCIe G1/G2 • 4 network links (torus topology)
GGAS–Microbenchmarking
• GPU-to-GPU streaming • Prototype system consisting of Nvidia
K20c & dual Intel Xeon E5 • Relative results applicable to
technology-related performance improvements
• MPI • CPU-controlled: D2H, MPI send/recv,
H2D • GGAS
• GPU-controlled: GDDR to GDDR, remote stores
• RMA: Remote Memory Access • Put/Get-based, CPU-to-CPU (host)
resp. GPU-to-GPU (direct)!17
GGASlatencystartingat1.9usec
P2PPCIeissue
Allreduce–PowerandEnergyanalysis
Lena Oden, Benjamin Klenk and Holger Fröning, Energy-Efficient Collective Reduce and Allreduce Operations on Distributed GPUs, 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid2014), May 26-29, 2014, Chicago, IL, US. !18
Forthiscase:50%oftheenergysaved
GGAS MPI
Analyzing Communication Models for Thread-parallel Processors
CommunicationModels
• MPI • CPU-controlled • De-facto standard, widely used, heavily optimized (for CPUs)
• GGAS (using mailboxes with send/receive semantics) • GPU-controlled • Communication by forwarding load/store operations using global address spaces • Completely in-line with the GPU execution model => highly thread parallel • Main drawback is reduced overlap
• RMA: Remote Memory Access • GPU-controlled • Put/Get operations of the custom interconnect • Communication engine designed for HPC • GPUs have to construct/interpret descriptors (which was very crucial for the IBVERBS
experiment)
!20
CompletingGGASwithPut/Getoperations
• Descriptor-based Put/Get operations • Completely asynchronous
• Work request with • Type • Local/Remote pointers • Credentials • Notification requests
• Notification • Completion information
with reference to work request
• Key is a simple descriptor format
!21
3 2 1 0
1
0
2
Dou
blew
ord
4
3
5
1011 89 67 45 23 11920 1718 1516 1314 122122232425 0262728293031
Read Address[63:0](64 bit)
Write Address[63:0](64 bit)
Destination Node(16 bit)
Payload Size (byte)(23 bit)
RSV
D.Len (2 b)
CMD(4 bit)
NOTI(3 bit)
ERA
NTR
MC
RSV
RSV
EWA
Destination VPID(8 bit)
Bytebit
Put/Get:DifferentWorkRequestQueueImplementations
• Descriptor format and queue organization matters • Explicit/implicit trigger, conversion effort, descriptor complexity
• FPGA implementation: we could even change the descriptor format
!22
IBVERBS descriptor format Custom descriptor format
Exampleapplications
• Testing is hard as applications have to be re-written for GPU-centric communication
• Set of 4 workloads implemented:
!23
nbody_small nbody_large sum_small sum_large himeno randomAccess
benchmarks
perfo
rman
ce n
orm
alize
d to
MPI
0.0
1.0
2.0
3.0
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
GGAS RMA
Performancecomparison-executiontime
!24Benjamin Klenk, Lena Oden, Holger Fröning, Analyzing Communication Models for Distributed Thread-Collaborative Processors in Terms of Energy and
Time, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2015), Philadelphia, PA, March 29-31, 2015.
• 2-12 nodes (each 2x Intel Ivy Bridge, Nvidia K20, FPGA network) • Normalized to MPI: >1 = better performance, <1 = worse performance
Performancecomparison-energyconsumption
!25Benjamin Klenk, Lena Oden, Holger Fröning, Analyzing Communication Models for Distributed Thread-Collaborative Processors in Terms of Energy and
Time, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2015), Philadelphia, PA, March 29-31, 2015.
NB−S NB−L sum−S sum−L RA Himeno avg.
benchmarksener
gy c
onsu
mpt
ion
norm
alize
d to
MPI
0.0
0.4
0.8
1.2
ggas rma
• 12 nodes (each 2x Intel Ivy Bridge, Nvidia K20, Extoll FPGA) • Normalized to MPI: <1 = better energy consumption, >1 = worse energy consumption
Performancecomparison-observations
!26
Observation N-Body
HimenoStencil
GlobalSum
RandomAccess
RMAandMPIofferabetterexploitationofoverlappossibilities X X
GGASperformsoutstandingforsmallpayloads,asnoindirectionsarerequiredlikecontextswitchestotheCPUorworkrequestissuestotheNIC X X X
ThePCIepeer-to-peerreadproblemresultsinMPIperformingbetterthanRMAorGGASforlargepayloadsizes X X
GGASincombinationwithRMAoutperformMPIsubstantially(withoutthePCIepeer-to-peerreadlimitation) X X X
Inpractice,theexecutiontimehasanessentialinfluenceonenergyconsumption X X X
AccessestohostmemorycontributesignificantlytoDRAMandCPUsocketpower X
BypassingacomponentlikeaCPUcansaveenoughpowertocompensatealongerexecutiontime,resultinginenergysavings X
StagingcopiescontributesignificantlytobothCPUandGPUpower,duetoinvolvedsoftwarestacksrespectivelyactiveDMAcontrollers X
Forirregularcommunicationpatternsorsmallpayloads,GGASsavesbothtimeandenergy X (X)
Related effort: Simplified multi-GPU programming
• Single-GPU programming based on CUDA/OpenCL is a prime example for a BSP execution model • Exposes large amounts of structured parallelism, rather easy to use
• Multi-GPU programming becoming more important, but adds huge amounts of complexity • Processor aggregation easy • Memory aggregation
challenging • Local/remote bandwidth disparity • UVA (Unified Virtual Addressing)
works out-of-the-box -> poor performance
• Solution: Use a compiler-based approach to automatically partition regular GPU code
Towardssimplifiedmulti-GPUprogramming
!28
GPUMekong
• Tool stack based on LLVM tool chain • Code analysis • Automated decision making • Code transformations for
automated partitioning and data movements
!29
• Regularity• Input/Outputdata,dimensionality
• Additionofsuper-blockID
• Indexmodification
• Executedkernels• Iterativeexecution?
• Multi-Deviceinitialization
• Datadistribution
GPUMekong
• Funded by a Google Research Award • Under heavy development • https://sites.google.com/site/gpumekong • Initial results for a 16-GPU matrix multiply
!30Alexander Matz, Mark Hummel, Holger Fröning, Exploring LLVM Infrastructure for Simplified Multi-GPU Programming, Ninth International Workshop on
Programmability and Architectures for Heterogeneous Multicores (MULTIPROG-2016), in conjunction with HiPEAC 2016, Prague, Czech Republic, Jan. 18, 2016.
0.25
0.50
0.75
1.00
2 4 6 8Number of GPUs
Spee
dup
2
4
6
4 8 12 16Number of GPUs
Spee
dup
N (NxN matrices) 4096 8192 12288 16384 20480 24576
Wrapping up
Summary
• GPUs as first-class citizens in a peer networking environment, capable of sourcing and sinking traffic • Traditional messaging libraries and models poorly match the GPU’s thread-
collaborative execution model • Also GPUs require fast communication paths with minimal latency/maximum
message rate, combined with high-overlap paths like Put/Get
• However: the semantic gap between architecture and user is growing • We need automated tooling to close this gap
• Unified communication models • Automatically selecting the right communication method and path between
heterogeneous computing units => flexibility of scheduling kernels • Multiple communication models will dramatically push complexity, too
• CACM 03/2015, John Hennessy: it’s the era of software!32
SpecializedprocessorslikeGPUsrequirespezializedcommunicationmodels/methods
Thank you!
CreditsContribuYons:LenaOden(formerPhDstudent),BenjaminKlenk(PhDstudent),
DanielSchlegel(graduatestudent),GüntherSchindler(graduatestudent)
Discussions:SudhaYalamanchili(GeorgiaTech),JeffYoung(GeorgiaTech),LarryDennison(Nvidia),HansEberle(Nvidia)
Sponsoring:Nvidia,Xilinx,GermanExcellenceIniYaYve,Google
CurrentmaininteracZons
!33
Recommended