Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
The Future of MPI
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
A Presentation at HPC Advisory Council Workshop, Malaga 2012
by
• Trends in Designing Petaflop and Exaflop Systems
• Overview of Programming Models and MPI
• Overview of new MPI-3 features
• Addressing Exascale Challenges in MVAPICH2 Project
• Conclusions
2
Presentation Overview
HPC Advisory Council, Malaga, Spain '12
High-End Computing (HEC): PetaFlop to ExaFlop
HPC Advisory Council, Malaga, Spain '12
Expected to have an ExaFlop system in 2019 -2021!
3
100
PFlops in
2015 1 EFlops in
2019
HPC Advisory Council, Malaga, Spain '12
Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)
Nov. 1996: 0/500 (0%) Jun. 2002: 80/500 (16%) Nov. 2007: 406/500 (81.2%)
Jun. 1997: 1/500 (0.2%) Nov. 2002: 93/500 (18.6%) Jun. 2008: 400/500 (80.0%)
Nov. 1997: 1/500 (0.2%) Jun. 2003: 149/500 (29.8%) Nov. 2008: 410/500 (82.0%)
Jun. 1998: 1/500 (0.2%) Nov. 2003: 208/500 (41.6%) Jun. 2009: 410/500 (82.0%)
Nov. 1998: 2/500 (0.4%) Jun. 2004: 291/500 (58.2%) Nov. 2009: 417/500 (83.4%)
Jun. 1999: 6/500 (1.2%) Nov. 2004: 294/500 (58.8%) Jun. 2010: 424/500 (84.8%)
Nov. 1999: 7/500 (1.4%) Jun. 2005: 304/500 (60.8%) Nov. 2010: 415/500 (83%)
Jun. 2000: 11/500 (2.2%) Nov. 2005: 360/500 (72.0%) Jun. 2011: 411/500 (82.2%)
Nov. 2000: 28/500 (5.6%) Jun. 2006: 364/500 (72.8%) Nov. 2011: 410/500 (82.0%)
Jun. 2001: 33/500 (6.6%) Nov. 2006: 361/500 (72.2%) Jun. 2012: 407/500 (81.4%)
Nov. 2001: 43/500 (8.6%) Jun. 2007: 373/500 (74.6%)
4
• 209 IB Clusters (41.8%) in the June 2012 Top500 list
(http://www.top500.org)
• Installations in the Top 40 (13 systems):
HPC Advisory Council, Malaga, Spain '12
Large-scale InfiniBand Installations
147, 456 cores (Super MUC) in Germany (4th) 78,660 cores (Lomonosov) in Russia (22nd )
77,184 cores (Curie thin nodes) at France/CEA (9th) 137,200 cores (Sunway Blue Light) in China (26th)
120, 640 cores (Nebulae) at China/NSCS (10th) 46,208 cores (Zin) at LLNL (27th)
125,980 cores (Pleiades) at NASA/Ames (11th) 29,440 cores (Mole-8.5) in China (37th)
70,560 cores (Helios) at Japan/IFERC (12th) 42,440 cores (Red Sky) at Sandia (39th)
73,278 cores (Tsubame 2.0) at Japan/GSIC (14th) More are getting installed !
138,368 cores (Tera-100) at France/CEA (17th)
122,400 cores (Roadrunner) at LANL (19th)
5
• Trends in Designing Petaflop and Exaflop Systems
• Overview of Programming Models and MPI
• Overview of new MPI-3 features
• Addressing Exascale Challenges in MVAPICH2 Project
• Conclusions
6
Presentation Overview
HPC Advisory Council, Malaga, Spain '12
HPC Advisory Council, Malaga, Spain '12 7
Parallel Programming Models Overview
P1 P2 P3
Shared Memory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory Memory
Logical shared memory
Shared Memory Model
SHMEM, DSM
Distributed Memory Model
MPI (Message Passing Interface)
Partitioned Global Address Space (PGAS)
Global Arrays, UPC, Chapel, X10, CAF, …
• Programming models provide abstract machine models
• Models can be mapped on different types of systems
– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.
• In this presentation series, we concentrate on MPI first and
then focus on PGAS
• Message Passing Library standardized by MPI Forum
– C, C++ and Fortran
• Goal: portable, efficient and flexible standard for writing
parallel applications
• Not IEEE or ISO standard, but widely considered “industry
standard” for HPC application
• Evolution of MPI
– MPI-1: 1994
– MPI-2: 1996
– MPI-3: on-going effort (2008 – current)
8
MPI Overview and History
HPC Advisory Council, Malaga, Spain '12
• Point-to-point Two-sided Communication
• Collective Communication
• One-sided Communication
• Job Startup
• Parallel I/O
Major MPI Features
9 HPC Advisory Council, Malaga, Spain '12
Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges
Programming Models Message Passing Interface (MPI), Sockets and PGAS
(UPC, OpenSHMEM, Global Arrays)
Applications/Libraries
Networking Technologies (InfiniBand, 1/10/40GigE, RoCE & Intelligent NICs)
Commodity Computing System Architectures
(single, dual, quad, ..) Multi/Many-core architecture
and Accelerators
HPC Advisory Council, Malaga, Spain '12
Point-to-point Communication QoS
Collective Communication
Synchronization & Locks I/O & File Systems
Fault Tolerance
Library or Runtime for Programming Models
10
Exascale System Targets
Systems 2010 2018 Difference Today & 2018
System peak 2 PFlop/s 1 EFlop/s O(1,000)
Power 6 MW ~20 MW (goal)
System memory 0.3 PB 32 – 64 PB O(100)
Node performance 125 GF 1.2 or 15 TF O(10) – O(100)
Node memory BW 25 GB/s 2 – 4 TB/s O(100)
Node concurrency 12 O(1k) or O(10k) O(100) – O(1,000)
Total node interconnect BW 3.5 GB/s 200 – 400 GB/s (1:4 or 1:8 from memory BW)
O(100)
System size (nodes) 18,700 O(100,000) or O(1M) O(10) – O(100)
Total concurrency 225,000 O(billion) + [O(10) to O(100) for latency hiding]
O(10,000)
Storage capacity 15 PB 500 – 1000 PB (>10x system memory is min)
O(10) – O(100)
IO Rates 0.2 TB 60 TB/s O(100)
MTTI Days O(1 day) -O(10)
Courtesy: DOE Exascale Study and Prof. Jack Dongarra
11 HPC Advisory Council, Malaga, Spain '12
• DARPA Exascale Report – Peter Kogge, Editor and Lead
• Energy and Power Challenge
– Hard to solve power requirements for data movement
• Memory and Storage Challenge
– Hard to achieve high capacity and high data rate
• Concurrency and Locality Challenge
– Management of very large amount of concurrency (billion threads)
• Resiliency Challenge
– Low voltage devices (for low power) introduce more faults
HPC Advisory Council, Malaga, Spain '12 12
Basic Design Challenges for Exascale Systems
• Power required for data movement operations is one of
the main challenges
• Non-blocking collectives
– Overlap computation and communication
• Much improved One-sided interface
– Reduce synchronization of sender/receiver
• Manage concurrency
– Improved interoperability with PGAS (e.g. UPC, Global Arrays,
OpenShmem)
• Resiliency
– New interface for detecting failures
HPC Advisory Council, Malaga, Spain '12 13
How does MPI plan to meet these challenges?
• Trends in Designing Petaflop and Exaflop Systems
• Overview of Programming Models and MPI
• Overview of new MPI-3 features
• Addressing Exascale Challenges in MVAPICH2 Project
• Conclusions
14
Presentation Overview
HPC Advisory Council, Malaga, Spain '12
• Major features
– Non-blocking Collectives
– Improved One-Sided (RMA) Model
– MPI Tools Interface
• Draft version available : http://meetings.mpi-
forum.org/MPI_3.0_main_page.php
• Targeted to be released during the next few weeks
HPC Advisory Council, Malaga, Spain '12 15
Major New Features in MPI-3
• Enables overlap of computation with communication
• Removes synchronization effects of collective operations
(exception of barrier)
• Completion when local part is complete
• Completion of one non-blocking collective does not imply
completion of other non-blocking collectives
• No “tag” for the collective operation
• Issuing many non-blocking collectives may exhaust
resources
– Quality implementations will ensure that this happens for only
pathological cases
HPC Advisory Council, Malaga, Spain '12 16
Non-blocking Collective operations
• Non-blocking calls do not match blocking collective calls
– MPI implementation may use different algorithms for blocking and
non-blocking collectives
– Therefore, non-blocking collectives cannot match blocking
collectives
– Blocking collectives: optimized for latency
– Non-blocking collectives: optimized for overlap
• User must call collectives in same order on all ranks
• Progress rules are same as those for point-to-point
• Example new calls: MPIX_Ibarrier, MPIX_Iallreduce, …
HPC Advisory Council, Malaga, Spain '12 17
Non-blocking Collective Operations (cont’d)
• Easy to express irregular pattern of communication
– Easier than request-response pattern using two-sided
• Decouple data transfer with synchronization
• Better overlap of communication with computation
HPC Advisory Council, Malaga, Spain '12 18
Recap of One Sided Communication
Rank 0
Rank 2
Rank 1
Rank 3
mem
mem
mem
mem
window
• Remote Memory Access (RMA)
• New proposal has major
improvements
• MPI-2: public and private windows
– Synchronization of windows explicit
• MPI-2: works for non-cache
coherent systems
• MPI-3: two types of windows
– Unified and Separate
– Unified window leverages hardware
cache coherence
HPC Advisory Council, Malaga, Spain '12 19
Improved One-sided Model
Processor
Private
Window
Public
Window
Incoming RMA Ops
Synchronization
Local Memory Ops
• MPI_Win_create_dynamic
– Window without memory attached
– MPI_Win_attach to attach memory to a window
• MPI_Win_allocate_shared
– Windows with shared memory
– Allows direct loads/store accesses by remote processes
• MPI_Rput, MPI_Rget, MPI_Raccumulate
– Local completion by using MPI_Wait on request objects
• MPI_Get_accumulate, MPI_Fetch_and_op
– Accumulate into target memory, return old data to origin
• MPI_Compare_and_swap
– Atomic compare and swap
HPC Advisory Council, Malaga, Spain '12 20
Highlights of New MPI-3 RMA Calls
• MPI_Win_lock_all
– Faster way to lock all members of win
• MPI_Win_flush / MPI_Win_flush_all
– Flush all RMA ops to target / window
• MPI_Win_flush_local / MPI_Win_flush_local_all
– Locally complete RMA ops to target / window
• MPI_Win_sync
– Synchronize public and private copies of window
• Overlapping accesses were “erroneous” in MPI-2
– They are “undefined” in MPI-3
HPC Advisory Council, Malaga, Spain '12 21
Highlights of New MPI-3 RMA Calls (Cont’d)
HPC Advisory Council, Malaga, Spain '12 22
MPI Tools Interface
• Extended tools support in MPI-3, beyond the PMPI interface
• Provide standardized interface (MPIT) to access MPI internal
information • Configuration and control information
• Eager limit, buffer sizes, . . .
• Performance information
• Time spent in blocking, memory usage, . . .
• Debugging information
• Packet counters, thresholds, . . .
• External tools can build on top of this standard interface
HPC Advisory Council, Malaga, Spain '12 23
Miscellaneous features
• New FORTRAN bindings
• Non-blocking File I/O
• Removed C++ bindings from standard
• Support for large data counts
• Scalable sparse collectives on process topologies
• Distributed graphs
• Topology aware communicator creation
• Non-collective (group-collective) communicator creation
• Enhanced Datatype support - MPI_TYPE_CREATE_HINDEXED_BLOCK
• Mprobe for hybrid programming on clusters of SMP nodes
• Numerous fixes in the existing proposal
• Trends in Designing Petaflop and Exaflop Systems
• Overview of Programming Models and MPI
• Overview of new MPI-3 features
• Addressing Exascale Challenges in MVAPICH2 Project
• Conclusions
24
Presentation Overview
HPC Advisory Council, Malaga, Spain '12
• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and
RDMA over Converged Enhanced Ethernet (RoCE)
– MVAPICH (MPI-1) and MVAPICH2 (MPI-2.2), Available since 2002
– MVAPICH2-X (MPI + PGAS), Available since 2012
– Used by more than 1,960 organizations (HPC Centers, Industry and Universities) in
70 countries
– More than 129,000 downloads from OSU site directly
– Empowering many TOP500 clusters
• 11th ranked 125,980-core cluster (Pleiades) at NASA
• 14th ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology
• 40h ranked 62,976-core cluster (Ranger) at TACC
• and many others
– Available with software stacks of many IB, HSE and server vendors
including Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
• Partner in the upcoming U.S. NSF-TACC Stampede (10-15 PFlop) System
25
MVAPICH2/MVAPICH2-X Software
HPC Advisory Council, Malaga, Spain '12
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
– Extremely minimum memory footprint
• Hybrid programming (MPI + OpenMP, MPI + OpenSHMEM, MPI + UPC, …)
• Balancing intra-node and inter-node communication for next generation multi-core (128-1024 cores/node)
– Multiple end-points per node
• Support for efficient multi-threading
• Support for GPGPUs and Accelerators (Intel MIC)
• Scalable Collective communication – Offload
– Non-blocking
– Topology-aware
– Power-aware
• Fault-tolerance/resiliency
• QoS support for communication and I/O
• MPI-3 Support will be available soon
Challenges being Addressed by MVAPICH2 for Exascale
26 HPC Advisory Council, Malaga, Spain '12
• High Performance and Scalable Inter-node Communication with Hybrid UD-
RC/XRC transport
• Kernel-based zero-copy intra-node communication
• Collective Communication
– Multi-core Aware, UD-Multicast-Based, Topology-aware and Power-aware
– Exploiting Collective Offload for Non-blocking Collective
• Support for GPGPUs and Accelerators
• Support for MPI-3 RMA Model
• PGAS Support (Hybrid MPI + UPC, Hybrid MPI + OpenSHMEM)
• QoS Support with Virtual Lanes
• Fault Tolerance and Fault Resilience
Overview of MVAPICH2 Designs
27 HPC Advisory Council, Malaga, Spain '12
HPC Advisory Council, Malaga, Spain '12 28
One-way Latency: MPI over IB
0.00
1.00
2.00
3.00
4.00
5.00
6.00Small Message Latency
Message Size (bytes)
Late
ncy
(u
s)
1.66
1.56
1.64
1.82
0.99
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch
0.00
50.00
100.00
150.00
200.00
250.00MVAPICH-Qlogic-DDR
MVAPICH-Qlogic-QDR
MVAPICH-ConnectX-DDR
MVAPICH-ConnectX2-PCIe2-QDR
MVAPICH-ConnectX3-PCIe3-FDR
Large Message Latency
Message Size (bytes)
Late
ncy
(u
s)
HPC Advisory Council, Malaga, Spain '12 29
Bandwidth: MPI over IB
0
1000
2000
3000
4000
5000
6000
7000Unidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
3280
3385
1917
1706
6343
-1000
1000
3000
5000
7000
9000
11000
13000
15000 MVAPICH-Qlogic-DDR
MVAPICH-Qlogic-QDR
MVAPICH-ConnectX-DDR
MVAPICH-ConnectX2-PCIe2-QDR
MVAPICH-ConnectX3-PCIe3-FDR
Bidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
3341
3704
4407
11643
6521
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch
• Both UD and RC/XRC have benefits
• Evaluate characteristics of all of them and use two sets of transports
in the same application – get the best of both
• Available in MVAPICH (since 1.1), as a separate interface
• Available since MVAPICH2 1.7 as an integrated interface
HPC Advisory Council, Malaga, Spain '12
Hybrid Transport Design (UD/RC/XRC)
M. Koop, T. Jones and D. K. Panda, “MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over
InfiniBand,” IPDPS ‘08
30
• Experimental Platform: LLNL Hyperion Cluster (Intel Harpertown + QDR IB)
• RC: Default Parameters
• UD: MV2_USE_UD=1
• Hybrid: MV2_HYBRID_ENABLE_THRESHOLD=1
HPC Advisory Council, Malaga, Spain '12
HPCC Random Ring Latency with MVAPICH2 (UD/RC/Hybrid)
31
0
1
2
3
4
5
6
128 256 512 1024
Tim
e (
us)
No. of Processes
UD
Hybrid
RC
38% 26% 40% 30%
0
5000
10000
15000
20000
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
Bi-Directional Bandwidth
intra-Socket-LiMIC
intra-Socket-Shmem
inter-Socket-LiMIC
inter-Socket-Shmem
0
2000
4000
6000
8000
10000
12000
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
Bandwidth
intra-Socket-LiMIC
intra-Socket-Shmem
inter-Socket-LiMIC
inter-Socket-Shmem
0
0.2
0.4
0.6
0.8
1
1.2
0 1 2 4 8 16 32 64 128 256 512 1K
Late
ncy
(u
s)
Message Size (Bytes)
Latency
Intra-Socket Inter-Socket
MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support)
32 HPC Advisory Council, Malaga, Spain '12
Latest MVAPICH2 1.8
Intel Westmere
10,000MB/s 18,500MB/s
0.19 us
0.45 us
• High Performance and Scalable Inter-node Communication with Hybrid UD-
RC/XRC transport
• Kernel-based zero-copy intra-node communication
• Collective Communication
– Multi-core Aware, UD-Multicast-Based, Topology-aware and Power-aware
– Exploiting Collective Offload for Non-Blocking Collective
• Support for GPGPUs and Accelerators
• MPI-3 RMA Support
• PGAS Support (Hybrid MPI + UPC, Hybrid MPI + OpenSHMEM)
• QoS Support with Virtual Lanes
• Fault Tolerance and Fault Resilience
Overview of MVAPICH2 Designs
33 HPC Advisory Council, Malaga, Spain '12
Shared-memory Aware Collectives
020406080
100120140160
0 4 8 16 32 64 128 256 512
La
ten
cy (
us
)
Message Size (bytes)
MPI_Reduce (4096 cores)
Original
Shared-memory
HPC Advisory Council, Malaga, Spain '12
0
1000
2000
3000
4000
5000
0 8 32 128 512 2K 8K
La
ten
cy (
us
)
Message Size (bytes)
MPI_ Allreduce (4096 cores)
Original
Shared-memory
34
MV2_USE_SHMEM_REDUCE=0/1 MV2_USE_SHMEM_ALLREDUCE=0/1
020406080
100120140
128 256 512 1024
Late
ncy
(u
s)
Number of Processes
Pt2Pt Barrier Shared-Mem Barrier
• MVAPICH2 Reduce/Allreduce with 4K cores on TACC Ranger (AMD Barcelona, SDR IB)
• MVAPICH2 Barrier with 1K Intel
Westmere cores , QDR IB
MV2_USE_SHMEM_BARRIER=0/1
HPC Advisory Council, Malaga, Spain '12 35
Hardware Multicast-aware MPI_Bcast
0
2
4
6
8
10
12
14
16
32 64 128 256 512 1024
Tim
e (u
s)
Number of Processes
16 byte MPI_Bcast
Default
Multicast
0
50
100
150
200
250
300
350
32 64 128 256 512 1024
Tim
e (u
s)
Number of Processes
64KB MPI_Bcast
Default
Multicast
0
500
1000
1500
2000
2500
2K 8K 32K 128K 512K
Tim
e (u
s)
Message Size (Bytes)
Large Messages (1,024 Cores)
Default
Multicast
0
5
10
15
20
25
1 4 16 64 256 1K
Tim
e (u
s)
Message Size (Bytes)
Small Messages (1,024 Cores)
Default
Multicast
Available in MVAPICH2 1.9
Non-contiguous Allocation of Jobs
• Supercomputing systems organized
as racks of nodes interconnected
using complex network
architectures
• Job schedulers used to allocate
compute nodes to various jobs
Line Card Switches
Line Card Switches
Spine Switches
- Busy Core
- Idle Core
New Job
- New Job
36 HPC Advisory Council, Malaga, Spain '12
Non-contiguous Allocation of Jobs
• Supercomputing systems organized
as racks of nodes interconnected
using complex network
architectures
• Job schedulers used to allocate
compute nodes to various jobs
• Individual processes belonging to
one job can get scattered
• Primary responsibility of scheduler is
to keep system throughput high
Line Card Switches
Line Card Switches
Spine Switches
- Busy Core
- Idle Core
- New Job
37 HPC Advisory Council, Malaga, Spain '12
Impact of Network-Topology Aware Algorithms on Broadcast Performance
0
5
10
15
20
25
30
35
40
128K 256K 512K 1M
Late
ncy
(m
s)
Message Size (Bytes)
No-Topo-Aware
Topo-Aware-DFT
Topo-Aware-BFT
HPC Advisory Council, Malaga, Spain '12 38
0
0.2
0.4
0.6
0.8
1
1.2
128 256 512 1K
No
rmal
ize
d L
ate
ncy
Job Size (Processes)
• Impact of topology aware schemes more pronounced as
• Message size increases
• System size scales
• Upto 14% improvement in performance at scale
14%
H. Subramoni, K. Kandalla, J. Vienne, S. Sur, B. Barth, K.Tomko, R. McLay, K. Schulz, and D. K. Panda, Design and Evaluation of Network Topology-/Speed- Aware Broadcast Algorithms for InfiniBand Clusters, Cluster ‘11
HPC Advisory Council, Malaga, Spain '12 39
Network-Topology-Aware Placement of Processes Can we design a highly scalable network topology detection service for IB? How do we design the MPI communication library in a network-topology-aware manner to efficiently leverage the topology information generated by our service? What are the potential benefits of using a network-topology-aware MPI library on the performance of parallel scientific applications?
Overall performance and Split up of physical communication for MILC on Ranger
Performance for varying system sizes Default for 2048 core run Topo-Aware for 2048 core run
15%
H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and D. K. Panda, Design of a Scalable
InfiniBand Topology Service to Enable Network-Topology-Aware Placement of Processes, SC'12 . BEST Paper and BEST STUDENT Paper
Finalist
• Reduce network topology discovery time from O(N2hosts) to O(Nhosts)
• 15% improvement in MILC execution time @ 2048 cores
• 15% improvement in Hypre execution time @ 1024 cores
Application benefits with Non-Blocking Collectives based on CX-2 Collective Offload
40
Modified P3DFFT with Offload-Alltoall does up to 17% better than default version (128 Processes)
K. Kandalla, et. al.. High-Performance and Scalable Non-Blocking
All-to-All with Collective Offload on InfiniBand Clusters: A Study
with Parallel 3D FFT, ISC 2011
17%
00.20.40.60.8
11.2
10 20 30 40 50 60 70
No
rmal
ize
d
Pe
rfo
rman
ce
HPL-Offload HPL-1ring HPL-Host
HPL Problem Size (N) as % of Total Memory
4.5%
Modified HPL with Offload-Bcast does up to 4.5% better than default version (512 Processes)
Modified Pre-Conjugate Gradient Solver with Offload-Allreduce does up to 21.8% better than default version
K. Kandalla, et. al, Designing Non-blocking Broadcast with
Collective Offload on InfiniBand Clusters: A Case Study
with HPL, HotI 2011
K. Kandalla, et. al., Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12
21.8%
HPC Advisory Council, Malaga, Spain '12
41
Non-blocking Neighborhood Collectives to minimize Communication Overheads of Irregular Graph Algorithms
HPC Advisory Council, Malaga, Spain '12
• Critical to improve communication overheads of irregular graph algorithms to cater to the graph-theoretic demands of emerging ‘’big-data’’ applications • What are the challenges involved in re-designing irregular graph algorithms, such as the 2D BFS, to achieve fine-grained computation/communication overlap? • Can we design network-offload based Non-blocking Neighborhood collectives in a high-performance, scalable manner?
CombBLAS (2D BFS) Application Run-time on Hyperion (Scale=29)
Can Network-Offload based Non-Blocking Neighborhood MPI Collectives Improve Communication Overheads of Irregular Graph Algorithms? K. Kandalla, A. Buluc, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and D. K. Panda (Accepted at Cluster-(IWPAPS) 2012
Communication Overheads in CombBLAS with 1936 Processes on Hyperion
70 %
20 %
6.5 %
42
Power and Energy Savings with Power-Aware Collectives Performance and Power Comparison : MPI_Alltoall with 64 processes on 8 nodes
K. Kandalla, E. P. Mancini, S. Sur and D. K. Panda, “Designing Power Aware Collective Communication Algorithms for Infiniband
Clusters”, ICPP ‘10
CPMD Application Performance and Power Savings
5% 32%
HPC Advisory Council, Malaga, Spain '12
• High Performance and Scalable Inter-node Communication with Hybrid UD-
RC/XRC transport
• Kernel-based zero-copy intra-node communication
• Collective Communication
– Multi-core Aware, UD-Multicast-Based, Topology-aware and Power-aware
– Exploiting Collective Offload for Non-blocking Collective
• Support for GPGPUs and Accelerators
• Support for MPI-3 RMA
• PGAS Support (Hybrid MPI + UPC, Hybrid MPI + OpenSHMEM)
• QoS Support with Virtual Lanes
• Fault Tolerance and Fault Resilience
Overview of MVAPICH2 Designs
43 HPC Advisory Council, Malaga, Spain '12
PCIe
GPU
CPU
NIC
Switch
At Sender:
cudaMemcpy(sbuf, sdev, . . .);
MPI_Send(sbuf, size, . . .);
At Receiver:
MPI_Recv(rbuf, size, . . .);
cudaMemcpy(rdev, rbuf, . . .);
Sample Code - Without MPI integration
• Naïve implementation with standard MPI and CUDA
• High Productivity and Poor Performance
44 HPC Advisory Council, Malaga, Spain '12
PCIe
GPU
CPU
NIC
Switch
At Sender: for (j = 0; j < pipeline_len; j++)
cudaMemcpyAsync(sbuf + j * blk, sdev + j * blksz,. . .);
for (j = 0; j < pipeline_len; j++) {
while (result != cudaSucess) {
result = cudaStreamQuery(…);
if(j > 0) MPI_Test(…);
}
MPI_Isend(sbuf + j * block_sz, blksz . . .);
}
MPI_Waitall();
Sample Code – User Optimized Code
• Pipelining at user level with non-blocking MPI and CUDA interfaces
• Code at Sender side (and repeated at Receiver side)
• User-level copying may not match with internal MPI design
• High Performance and Poor Productivity
HPC Advisory Council, Malaga, Spain '12 45
Can this be done within MPI Library?
• Support GPU to GPU communication through standard MPI
interfaces
– e.g. enable MPI_Send, MPI_Recv from/to GPU memory
• Provide high performance without exposing low level details
to the programmer
– Pipelined data transfer which automatically provides optimizations
inside MPI library without user tuning
• A new Design incorporated in MVAPICH2 to support this
functionality
46 HPC Advisory Council, Malaga, Spain '12
At Sender:
At Receiver:
MPI_Recv(r_device, size, …);
inside MVAPICH2
Sample Code – MVAPICH2-GPU
• MVAPICH2-GPU: standard MPI interfaces used
• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
• Overlaps data movement from GPU with RDMA transfers
• High Performance and High Productivity
47 HPC Advisory Council, Malaga, Spain '12
MPI_Send(s_device, size, …);
MPI-Level Two-sided Communication
• 45% improvement compared with a naïve user-level implementation
(Memcpy+Send), for 4MB messages
• 24% improvement compared with an advanced user-level implementation
(MemcpyAsync+Isend), for 4MB messages
0
500
1000
1500
2000
2500
3000
32K 64K 128K 256K 512K 1M 2M 4M
Tim
e (
us)
Message Size (bytes)
Memcpy+Send
MemcpyAsync+Isend
MVAPICH2-GPU
H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur and D. K. Panda, MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters, ISC ‘11
48 HPC Advisory Council, Malaga, Spain '12
Other MPI Operations and Optimizations for GPU Buffers
• Overlap optimizations for
– One-sided Communication
– Collectives
– Communication with Datatypes
• Optimized Designs for multi-GPUs per node
– Use CUDA IPC (in CUDA 4.1), to avoid copy through host memory
49
• H. Wang, S. Potluri, M. Luo, A. Singh, X. Ouyang, S. Sur and D. K. Panda, Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2, IEEE Cluster '11, Sept. 2011
HPC Advisory Council, Malaga, Spain '12
• A. Singh, S. Potluri, H. Wang, K. Kandalla, S. Sur and D. K. Panda, MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits, Workshop on Parallel Programming on Accelerator Clusters (PPAC '11), held in conjunction with Cluster '11, Sept. 2011
• S. Potluri et al. Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication, Workshop on Accelerators and Hybrid Exascale Systems(ASHES), to be held in conjunction with IPDPS 2012, May 2012
MVAPICH2 1.8 and 1.9a Release
• Support for MPI communication from NVIDIA GPU device memory
• High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU)
• High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)
• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node
• Optimized and tuned collectives for GPU device buffers
• MPI datatype support for point-to-point and collective communication from GPU device buffers
50 HPC Advisory Council, Malaga, Spain '12
HPC Advisory Council, Malaga, Spain '12 51
Applications-Level Evaluation
• LBM-CUDA (Courtesy: Carlos Rosale, TACC) • Lattice Boltzmann Method for multiphase flows with large density ratios • one process/GPU per node, 16 nodes
• AWP-ODC (Courtesy: Yifeng Cui, SDSC) • a seismic modeling code, Gordon Bell Prize finalist at SC 2010 • 128x256x512 data grid per process, 8 nodes
• Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla M2070, one Mellanox IB QDR MT26428 adapter and 48 GB of main memory
0
20
40
60
80
100
120
140
256*256*256 512*256*256 512*512*256 512*512*512
Ste
p T
ime
(S)
Domain Size X*Y*Z
MPI MPI-GPU
13.7% 12.0%
11.8%
9.4%
0102030405060708090
1 GPU/Proc per Node 2 GPUs/Procs per NodeTota
l Ex
ecu
tio
n T
ime
(se
c)
Configuration
AWP-ODC
MPI MPI-GPU
11.1% 7.9%
LBM-CUDA
Many Integrated Core (MIC) Architecture
• Intel’s Many Integrated Core (MIC) architecture geared for HPC
• Critical part of their solution for exascale computing
• Knights Ferry (KNF) and Knights Corner (KNC)
• Many low-power x86 processor cores with hardware threads and
wider vector units
• Applications and libraries can run with minor or no modifications
• Using MVAPICH2 for communication within a KNF coprocessor
– out-of-the box (Default)
– shared memory designs tuned for KNF (Optimized V1)
– designs enhanced using lower level API (Optimized V2)
52 HPC Advisory Council, Malaga, Spain '12
53
Point-to-Point Communication
Knights Ferry - D0 1.20GHz card - 32 cores Alpha 9 MIC software stack with an additional pre-alpha patch
0.0
0.2
0.4
0.6
0.8
1.0
0 2 8 32 128 512 2K 8K
No
rma
lize
d L
ate
ncy
Message Size (Bytes)
Latency
Default
Optimized V1
Optimized V2
13% 17%
0.0
0.2
0.4
0.6
0.8
1.0
32K 128K 512K 2M
No
rma
lize
d L
ate
ncy
Message Size (Bytes)
Latency
Default
Optimized V1
Optimized V2
70% 80%
0.0
0.2
0.4
0.6
0.8
1.0
1 4 16 64 256 1K 4K 16K
No
rma
lize
d B
an
dw
idth
Message Size (Bytes)
Bandwidth
Default
Optimized V1
Optimized V2
0.0
0.2
0.4
0.6
0.8
1.0
32K 128K 512K 2M
No
rma
lize
d B
an
dw
idth
Message Size (Bytes)
Bandwidth
Default
Optimized V1
Optimized V2
15%
4X
9.5X
HPC Advisory Council, Malaga, Spain '12
54
Collective Communication
0.0
0.2
0.4
0.6
0.8
1.0
0 2 8 32 128 512
No
rma
lize
d L
ate
ncy
Message Size (Bytes)
Broadcast
Default
Optimized V1
Optimized V2
17%
0.0
0.2
0.4
0.6
0.8
1.0
64K 128K 256K 512K 1M
No
rma
lize
d L
ate
ncy
Message Size (Bytes)
Scatter
Default
Optimized V1
Optimized V2
72% 80%
S. Potluri, K. Tomko, D. Bureddy and D. K. Panda – Intra-MIC MPI Communication using MVPAICH2: Early Experience, TACC-Intel Highly-Parallel Computing Symposium, April 2012 – Best Student Paper Award
0.0
0.2
0.4
0.6
0.8
1.0
2K 8K 32K 128K 512K
No
rma
lize
d L
ate
ncy
Message Size (Bytes)
Broadcast
Default
Optimized V1
Optimized V2
20% 26%
HPC Advisory Council, Malaga, Spain '12
• High Performance and Scalable Inter-node Communication with Hybrid UD-
RC/XRC transport
• Kernel-based zero-copy intra-node communication
• Collective Communication
– Multi-core Aware, UD-Multicast-Based, Topology-aware and Power-aware
– Exploiting Collective Offload for Non-blocking collective
• Support for GPGPUs and Accelerators
• Support for MPI-3 RMA
• PGAS Support (Hybrid MPI + UPC, Hybrid MPI + OpenSHMEM)
• QoS Support with Virtual Lanes
• Fault Tolerance and Fault Resilience
MPI Design Challenges
55 HPC Advisory Council, Malaga, Spain '12
HPC Advisory Council, Malaga, Spain '12 56
Support for MPI-3 RMA Model
MPI-3 One Sided Communication
Accumulate Ordering Undefined Conflicting
Accesses
Separate and Unified
Windows
Window Creation
• Win_allocate
• Win_allocate_shared
• Win_create_dynamic, Win_attach, Win_detach
Synchronization
• Lock_all, Unlock_all
• Win_flush, Win_flush_local, Win_flush_all, Win_flush_local_all
• Win_sync
Communication
• Get_accumulate
• Rput, Rget, Raccumulate, Rget_accumulate
• Fetch_and_op, Compare_and_swap
HPC Advisory Council, Malaga, Spain '12 57
Support for MPI-3 RMA Model
• Flush Operations
– Local and remote completions bundled in MPI-2
– Considerable overhead on networks like IB - semantics and cost
of local and remote completions are different
– Flush operations allow more efficient check for completions
• Request-based Operations
– Current semantics provide bulk synchronization
– Request based operations return an MPI Request, can be polled
individually
– Allow for better overlap with direct one-sided design in MVAPICH2
• Dynamic Windows
– Allows users to attach and detach memory dynamically, key is to hide
the overheads (such as exchange of RDMA key exchange)
HPC Advisory Council, Malaga, Spain '12 58
0
1
2
3
4
5
6
7
8
9
1 4 16 64 256 1K 4K
Tim
e (
use
c)
Message Size (Bytes)
Put+Unlock Put+Flush_local Put+Flush
0
20
40
60
80
100
2K 8K 32K 128K 512K 2M
Pe
rce
nta
ge O
verl
ap
Message Size (Bytes)
Lock-Unlock Request Ops
• Flush_local allows for a faster check on local completions
• Request-based operations provide much better overlap than
Lock-Unlock, in a Get-Compute-Put communication pattern
Support for MPI-3 RMA Model
S. Potluri, S. Sur, D. Bureddy and D. K. Panda – Design and Implementation of Key Proposed MPI-3 One-Sided Communication Semantics on InfiniBand, EuroMPI 2011, September 2011
• High Performance and Scalable Inter-node Communication with Hybrid UD-
RC/XRC transport
• Kernel-based zero-copy intra-node communication
• Collective Communication
– Multi-core Aware, UD-Multicast-Based, Topology-aware and Power-aware
– Exploiting Collective Offload for Non-blocking collective
• Support for GPGPUs and Accelerators
• Support for MPI-3 RMA
• PGAS Support (Hybrid MPI + UPC, Hybrid MPI + OpenSHMEM)
• QoS Support with Virtual Lanes
• Fault Tolerance and Fault Resilience
MPI Design Challenges
59 HPC Advisory Council, Malaga, Spain '12
• Library-based
– Global Arrays
– OpenSHMEM
• Compiler-based
– Unified Parallel C (UPC)
– Co-Array Fortran (CAF)
• HPCS Language-based
– X10
– Chapel
– Fortress
HPC Advisory Council, Malaga, Spain '12 60
Different Approaches for Supporting PGAS Models
HPC Advisory Council, Malaga, Spain '12 61
Scalable OpenSHMEM and Hybrid (MPI and OpenSHMEM) designs
Hybrid MPI+OpenSHMEM
• Based on OpenSHMEM Reference Implementation http://openshmem.org/
• Provides a design over GASNet
• Does not take advantage of all OFED features
• Design scalable and High-Performance OpenSHMEM over OFED
• Designing a Hybrid MPI + OpenSHMEM Model
• Current Model – Separate Runtimes for OpenSHMEM and MPI
• Possible deadlock if both runtimes are not progressed
• Consumes more network resource
• Our Approach – Single Runtime for MPI and OpenSHMEM
Hybrid (OpenSHMEM+MPI) Application
InfiniBand Network
OSU Design
OpenSHMEM
InterfaceMPI
Interface
OpenSHMEM
callsMPI calls
Available in MVAPICH2-X
HPC Advisory Council, Malaga, Spain '12 62
Micro-Benchmark Performance (OpenSHMEM)
0
2
4
6
8
fadd finc cswap swap add inc
Tim
e (u
s)
OpenSHMEM-GASNetOpenSHMEM-OSU
0
2
4
6
8
32 64 128 256
Tim
e (m
s)
No. of Processes
OpenSHMEM-GASNet
OpenSHMEM-OSU41%
Atomic Operations Broadcast (256 bytes)
41% 16%
0
2
4
6
8
10
32 64 128 256
Tim
e (m
s)
No. of Processes
OpenSHMEM-GASNet
OpenSHMEM-OSU35%
Collect (256 bytes)
0
4
8
12
16
32 64 128 256
Tim
e (m
s)
No. of Processes
OpenSHMEM-GASNet
OpenSHMEM-OSU36%
Reduce (256 bytes)
HPC Advisory Council, Malaga, Spain '12 63
Performance of Hybrid (OpenSHMEM+MPI) Applications
0
400
800
1200
1600
32 64 128 256 512
Tim
e (s
)
No. of Processes
Hybrid-GASNet
Hybrid-OSU
34%
0
2
4
6
8
32 64 128 256
Tim
e (s
)
No. of Processes
Hybrid-GASNet
Hybrid-OSU
2D Heat Transfer Modeling Graph-500
0
20
40
60
80
32 64 128 256 512
Res
ou
rce
Uti
lizat
ion
(M
B)
No. of Processes
Hybrid-GASNetHybrid-OSU
27%
45%
Network Resource Usage
• Improved Performance for Hybrid
Applications
• 34% improvement for 2DHeat Transfer Modeling with 512 processes
• 45% improvement for Graph500 with 256 processes
• Our approach with single Runtime consumes 27% lesser network resources
J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and
Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012
Evaluation of Hybrid MPI+UPC NAS-FT
0
5
10
15
20
25
30
35
B-64 C-64 B-128 C-128
Tim
e (s
ec)
Class-Processes
GASNet-UCR
GASNet-IBV
GASNet-MPI
Hybrid
• Modified NAS FT UPC all-to-all pattern using MPI_Alltoall
• Truly hybrid program
• 34% improvement for FT (C, 128)
64 HPC Advisory Council, Malaga, Spain '12
Hybrid MPI + UPC Support
Will be Available in future
MVAPICH2-X Release
• Performance and Memory scalability toward 500K-1M cores – Dynamically Connected Transport (DCT) service with Connect-IB
• Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …) • Enhanced Optimization for GPU Support and Accelerators
– Extending the GPGPU support (GPU-Direct RDMA)
– Support for Intel MIC
• Taking advantage of Collective Offload framework – Including support for non-blocking collectives (MPI 3.0)
• RMA support (as in MPI 3.0)
• Extended topology-aware collectives
• Power-aware collectives
• Support for MPI Tools Interface (as in MPI 3.0)
• Checkpoint-Restart and migration support with in-memory checkpointing
• QoS-aware I/O and checkpointing
MVAPICH2 – Plans for Exascale
65 HPC Advisory Council, Malaga, Spain '12
• Presented challenges for designing programming models for Exascale systems
• Overview of MPI and MPI-3 Features
• How MVAPICH2 is addressing some of these challenges
• MPI-3 Plans for MVAPICH2
• MVAPICH2 plans for Exascale systems
Conclusions
66 HPC Advisory Council, Malaga, Spain '12
HPC Advisory Council, Malaga, Spain '12
Funding Acknowledgments
Funding Support by
Equipment Support by
67
HPC Advisory Council, Malaga, Spain '12
Personnel Acknowledgments
Current Students – J. Chen (Ph.D.)
– N. Islam (Ph.D.)
– J. Jose (Ph.D.)
– K. Kandalla (Ph.D.)
– M. Li (Ph.D.)
– M. Luo (Ph.D.)
– S. Potluri (Ph.D.)
– R. Rajachandrasekhar (Ph.D.)
– M. Rahman (Ph.D.)
– H. Subramoni (Ph.D.)
– A. Venkatesh (Ph.D.)
Past Students – P. Balaji (Ph.D.)
– D. Buntinas (Ph.D.)
– S. Bhagvat (M.S.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
68
Past Research Scientist – S. Sur
Current Post-Docs – X. Lu
– J. Vienne
– H. Wang
Current Programmers – M. Arnold
– D. Bureddy
– J. Perkins
Past Post-Docs – X. Besseron
– H.-W. Jin
– E. Mancini
– S. Marcarelli
HPC Advisory Council, Malaga, Spain '12
Web Pointers
http://www.cse.ohio-state.edu/~panda
http://nowlab.cse.ohio-state.edu
MVAPICH Web Page
http://mvapich.cse.ohio-state.edu
69