Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Programming Models and their Designs for Exascale Systems
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Talk at HPC Advisory Council Stanford Conference (2013)
by
• Growth of High Performance Computing
(HPC)
– Growth in processor performance
• Chip density doubles every 18 months
– Growth in commodity networking
• Increase in speed/features + reducing cost
HPC Advisory Council Stanford Conference, Feb '13
Current and Next Generation HPC Systems and Applications
2
High-End Computing (HEC): PetaFlop to ExaFlop
HPC Advisory Council Stanford Conference, Feb '13
Expected to have an ExaFlop system in 2019 -2022!
3
100
PFlops in
2015
1 EFlops in
2018
HPC Advisory Council Stanford Conference, Feb '13 4
Trends for Commodity Computing Clusters in the Top 500 Supercomputer List (http://www.top500.org)
0
10
20
30
40
50
60
70
80
90
100
0
50
100
150
200
250
300
350
400
450
500
Pe
rce
nta
ge o
f C
lust
ers
Nu
mb
er
of
Clu
ste
rs
Timeline
Percentage of Clusters
Number of Clusters
• 224 IB Clusters (44.8%) in the November 2012 Top500 list
(http://www.top500.org)
• Installations in the Top 40 (16 systems):
HPC Advisory Council Stanford Conference, Feb '13
Large-scale InfiniBand Installations
147, 456 cores (Super MUC) in Germany (6th) 122,400 cores (Roadrunner) at LANL (22nd)
204,900 cores (Stampede) at TACC (7th) 53,504 (PRIMERGY) at Australia/NCI (24th)
77,184 cores (Curie thin nodes) at France/CEA (11th) 78,660 cores (Lomonosov) in Russia (26th )
120, 640 cores (Nebulae) at China/NSCS (12th) 137,200 cores (Sunway Blue Light) in China (28th)
72,288 cores (Yellowstone) at NCAR (13th) 46,208 cores (Zin) at LLNL (29th)
125,980 cores (Pleiades) at NASA/Ames (14th) 33,664 (MareNostrum) at Spain/BSC (36th)
70,560 cores (Helios) at Japan/IFERC (15th) 32,256 (SGI Altix X) at Japan/CRIEPI (39th)
73,278 cores (Tsubame 2.0) at Japan/GSIC (17th) More are getting installed !
138,368 cores (Tera-100) at France/CEA (20th)
5
HPC Advisory Council Stanford Conference, Feb '13 6
Parallel Programming Models Overview
P1 P2 P3
Shared Memory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory Memory
Logical shared memory
Shared Memory Model
SHMEM, DSM
Distributed Memory Model
MPI (Message Passing Interface)
Partitioned Global Address Space (PGAS)
Global Arrays, UPC, Chapel, X10, CAF, …
• Programming models provide abstract machine models
• Models can be mapped on different types of systems
– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.
• In this presentation, we concentrate on MPI first and then
focus on PGAS
• Message Passing Library standardized by MPI Forum
– C and Fortran
• Goal: portable, efficient and flexible standard for writing
parallel applications
• Not IEEE or ISO standard, but widely considered “industry
standard” for HPC application
• Evolution of MPI
– MPI-1: 1994
– MPI-2: 1996
– MPI-3.0: 2008 – 2012, standardized before SC ‘12
– Next plans for MPI 3.1, 3.2, ….
7
MPI Overview and History
HPC Advisory Council Stanford Conference, Feb '13
• Point-to-point Two-sided Communication
• Collective Communication
• One-sided Communication
• Job Startup
• Parallel I/O
Major MPI Features
8 HPC Advisory Council Stanford Conference, Feb '13
Exascale System Targets
Systems 2010 2018 Difference Today & 2018
System peak 2 PFlop/s 1 EFlop/s O(1,000)
Power 6 MW ~20 MW (goal)
System memory 0.3 PB 32 – 64 PB O(100)
Node performance 125 GF 1.2 or 15 TF O(10) – O(100)
Node memory BW 25 GB/s 2 – 4 TB/s O(100)
Node concurrency 12 O(1k) or O(10k) O(100) – O(1,000)
Total node interconnect BW 3.5 GB/s 200 – 400 GB/s (1:4 or 1:8 from memory BW)
O(100)
System size (nodes) 18,700 O(100,000) or O(1M) O(10) – O(100)
Total concurrency 225,000 O(billion) + [O(10) to O(100) for latency hiding]
O(10,000)
Storage capacity 15 PB 500 – 1000 PB (>10x system memory is min)
O(10) – O(100)
IO Rates 0.2 TB 60 TB/s O(100)
MTTI Days O(1 day) -O(10)
Courtesy: DOE Exascale Study and Prof. Jack Dongarra
9 HPC Advisory Council Stanford Conference, Feb '13
• DARPA Exascale Report – Peter Kogge, Editor and Lead
• Energy and Power Challenge
– Hard to solve power requirements for data movement
• Memory and Storage Challenge
– Hard to achieve high capacity and high data rate
• Concurrency and Locality Challenge
– Management of very large amount of concurrency (billion threads)
• Resiliency Challenge
– Low voltage devices (for low power) introduce more faults
HPC Advisory Council Stanford Conference, Feb '13 10
Basic Design Challenges for Exascale Systems
• Power required for data movement operations is one of
the main challenges
• Non-blocking collectives
– Overlap computation and communication
• Much improved One-sided interface
– Reduce synchronization of sender/receiver
• Manage concurrency
– Improved interoperability with PGAS (e.g. UPC, Global Arrays,
OpenShmem)
• Resiliency
– New interface for detecting failures
HPC Advisory Council Stanford Conference, Feb '13 11
How does MPI plan to meet these challenges?
• Major features
– Non-blocking Collectives
– Improved One-Sided (RMA) Model
– MPI Tools Interface
• Specification is available from: http://www.mpi-
forum.org/docs/mpi-3.0/mpi30-report.pdf
HPC Advisory Council Stanford Conference, Feb '13 12
Major New Features in MPI-3
• Library-based
– Global Arrays
– OpenSHMEM
• Compiler-based
– Unified Parallel C (UPC)
– Co-Array Fortran (CAF)
• HPCS Language-based
– X10
– Chapel
– Fortress
HPC Advisory Council Stanford Conference, Feb '13 13
Different Approaches for Supporting PGAS Models
Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges
Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM),
CUDA, OpenACC, Cilk, Hadoop, MapReduce, etc.
Application Kernels/Applications
Networking Technologies (InfiniBand, 10/40GigE,
Gemini, BlueGene)
Multi/Many-core Architectures
Point-to-point Communication (two-sided & one-sided)
Collective Communication
Synchronization & Locks
I/O & File Systems
Fault Tolerance
Communication Library or Runtime for Programming Models
14
Accelerators (NVIDIA and MIC)
Middleware
HPC Advisory Council Stanford Conference, Feb '13
Co-Design
Opportunities
and Challenges
across Various
Layers
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
– Extremely minimum memory footprint
• Hybrid programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, …)
• Balancing intra-node and inter-node communication for next generation multi-core (128-1024 cores/node)
– Multiple end-points per node
• Support for efficient multi-threading
• Support for GPGPUs and Accelerators
• Scalable Collective communication – Offload
– Non-blocking
– Topology-aware
– Power-aware
• Fault-tolerance/resiliency
• QoS support for communication and I/O
Designing (MPI+X) at Exascale
15 HPC Advisory Council Stanford Conference, Feb '13
• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and
RDMA over Converged Enhanced Ethernet (RoCE)
– MVAPICH (MPI-1) ,MVAPICH2 (MPI-2.2 and initial MPI-3.0), Available since 2002
– MVAPICH2-X (MPI + PGAS), Available since 2012
– Used by more than 2,000 organizations (HPC Centers, Industry and Universities) in
70 countries
– More than 150,000 downloads from OSU site directly
– Empowering many TOP500 clusters
• 7th ranked 204,900-core cluster (Stampede) at TACC
• 14th ranked 125,980-core cluster (Pleiades) at NASA
• 17th ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology
• and many others
– Available with software stacks of many IB, HSE and server vendors
including Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
• Partner in the U.S. NSF-TACC Stampede (9 PFlop) System
16
MVAPICH2/MVAPICH2-X Software
HPC Advisory Council Stanford Conference, Feb '13
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
– Extremely minimum memory footprint
• Support for GPGPUs and Co-Processors (Intel MIC)
• Scalable Collective communication – Multicore-aware and Hardware-multicast-based
– Topology-aware
– Offload and Non-blocking
– Power-aware
• QoS support for communication and I/O
• Application Scalability
• Hybrid programming (MPI + OpenSHMEM, MPI + UPC, …)
Challenges being Addressed by MVAPICH2 for Exascale
17 HPC Advisory Council Stanford Conference, Feb '13
HPC Advisory Council Stanford Conference, Feb '13 18
One-way Latency: MPI over IB
0.00
1.00
2.00
3.00
4.00
5.00
6.00Small Message Latency
Message Size (bytes)
Late
ncy
(u
s)
1.66
1.56
1.64
1.82
0.99 1.09
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch
0.00
50.00
100.00
150.00
200.00
250.00MVAPICH-Qlogic-DDRMVAPICH-Qlogic-QDRMVAPICH-ConnectX-DDRMVAPICH-ConnectX2-PCIe2-QDRMVAPICH-ConnectX3-PCIe3-FDRMVAPICH2-Mellanox-ConnectIB-DualFDR
Large Message Latency
Message Size (bytes)
Late
ncy
(u
s)
HPC Advisory Council Stanford Conference, Feb '13 19
Bandwidth: MPI over IB
0
2000
4000
6000
8000
10000
12000
14000Unidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
3280
3385
1917 1706
6343
12485
0
5000
10000
15000
20000
25000 MVAPICH-Qlogic-DDRMVAPICH-Qlogic-QDRMVAPICH-ConnectX-DDRMVAPICH-ConnectX2-PCIe2-QDRMVAPICH-ConnectX3-PCIe3-FDRMVAPICH2-Mellanox-ConnectIB-DualFDR
Bidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
3341 3704
4407
11643
6521
21025
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch
• Both UD and RC/XRC have benefits
• Evaluate characteristics of all of them and use two sets of transports
in the same application – get the best of both
• Application transparent
• Available since MVAPICH2 1.7
HPC Advisory Council Stanford Conference, Feb '13
Hybrid Transport Design (UD/RC/XRC)
M. Koop, T. Jones and D. K. Panda, “MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over
InfiniBand,” IPDPS ‘08
20
• Experimental Platform: LLNL Hyperion Cluster (Intel Harpertown + QDR IB)
• RC: Default Parameters
• UD: MV2_USE_UD=1
• Hybrid: MV2_HYBRID_ENABLE_THRESHOLD=1
HPC Advisory Council Stanford Conference, Feb '13
HPCC Random Ring Latency with MVAPICH2 (UD/RC/Hybrid)
21
0
1
2
3
4
5
6
128 256 512 1024
Tim
e (
us)
No. of Processes
UD
Hybrid
RC
38% 26% 40% 30%
0
5000
10000
15000
20000
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
Bi-Directional Bandwidth
intra-Socket-LiMIC
intra-Socket-Shmem
inter-Socket-LiMIC
inter-Socket-Shmem
0
2000
4000
6000
8000
10000
12000
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
Bandwidth
intra-Socket-LiMIC
intra-Socket-Shmem
inter-Socket-LiMIC
inter-Socket-Shmem
0
0.2
0.4
0.6
0.8
1
1.2
0 1 2 4 8 16 32 64 128 256 512 1K
Late
ncy
(u
s)
Message Size (Bytes)
Latency
Intra-Socket Inter-Socket
MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support)
22 HPC Advisory Council Stanford Conference, Feb '13
Latest MVAPICH2 1.9a2
Intel Westmere
10,000MB/s 18,500MB/s
0.19 us
0.45 us
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
– Extremely minimum memory footprint
• Support for GPGPUs and Co-Processors (Intel MIC)
• Scalable Collective communication – Multicore-aware and Hardware-multicast-based
– Topology-aware
– Offload and Non-blocking
– Power-aware
• QoS support for communication and I/O
• Application Scalability
• Hybrid programming (MPI + OpenSHMEM, MPI + UPC, …)
Challenges being Addressed by MVAPICH2 for Exascale
23 HPC Advisory Council Stanford Conference, Feb '13
PCIe
GPU
CPU
NIC
Switch
At Sender:
cudaMemcpy(sbuf, sdev, . . .);
MPI_Send(sbuf, size, . . .);
At Receiver:
MPI_Recv(rbuf, size, . . .);
cudaMemcpy(rdev, rbuf, . . .);
Sample Code - Without MPI integration
• Naïve implementation with standard MPI and CUDA
• High Productivity and Poor Performance
24 HPC Advisory Council Stanford Conference, Feb '13
PCIe
GPU
CPU
NIC
Switch
At Sender: for (j = 0; j < pipeline_len; j++)
cudaMemcpyAsync(sbuf + j * blk, sdev + j * blksz,. . .);
for (j = 0; j < pipeline_len; j++) {
while (result != cudaSucess) {
result = cudaStreamQuery(…);
if(j > 0) MPI_Test(…);
}
MPI_Isend(sbuf + j * block_sz, blksz . . .);
}
MPI_Waitall();
Sample Code – User Optimized Code
• Pipelining at user level with non-blocking MPI and CUDA interfaces
• Code at Sender side (and repeated at Receiver side)
• User-level copying may not match with internal MPI design
• High Performance and Poor Productivity
HPC Advisory Council Stanford Conference, Feb '13 25
Can this be done within MPI Library?
• Support GPU to GPU communication through standard MPI
interfaces
– e.g. enable MPI_Send, MPI_Recv from/to GPU memory
• Provide high performance without exposing low level details
to the programmer
– Pipelined data transfer which automatically provides optimizations
inside MPI library without user tuning
• A new Design incorporated in MVAPICH2 to support this
functionality
26 HPC Advisory Council Stanford Conference, Feb '13
At Sender:
At Receiver:
MPI_Recv(r_device, size, …);
inside MVAPICH2
Sample Code – MVAPICH2-GPU
• MVAPICH2-GPU: standard MPI interfaces used
• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
• Overlaps data movement from GPU with RDMA transfers
• High Performance and High Productivity
27 HPC Advisory Council Stanford Conference, Feb '13
MPI_Send(s_device, size, …);
MPI-Level Two-sided Communication
• 45% improvement compared with a naïve user-level implementation
(Memcpy+Send), for 4MB messages
• 24% improvement compared with an advanced user-level implementation
(MemcpyAsync+Isend), for 4MB messages
0
500
1000
1500
2000
2500
3000
32K 64K 128K 256K 512K 1M 2M 4M
Tim
e (
us)
Message Size (bytes)
Memcpy+Send
MemcpyAsync+Isend
MVAPICH2-GPU
H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur and D. K. Panda, MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters, ISC ‘11
28 HPC Advisory Council Stanford Conference, Feb '13
Other MPI Operations and Optimizations for GPU Buffers
• Overlap optimizations for
– One-sided Communication
– Collectives
– Communication with Datatypes
• Optimized Designs for multi-GPUs per node
– Use CUDA IPC (in CUDA 4.1), to avoid copy through host memory
29
• H. Wang, S. Potluri, M. Luo, A. Singh, X. Ouyang, S. Sur and D. K. Panda, Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2, IEEE Cluster '11, Sept. 2011
HPC Advisory Council Stanford Conference, Feb '13
• A. Singh, S. Potluri, H. Wang, K. Kandalla, S. Sur and D. K. Panda, MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits, Workshop on Parallel Programming on Accelerator Clusters (PPAC '11), held in conjunction with Cluster '11, Sept. 2011
• S. Potluri et al. Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication, Workshop on Accelerators and Hybrid Exascale Systems(ASHES), to be held in conjunction with IPDPS 2012, May 2012
MVAPICH2 1.8 and 1.9 Series
• Support for MPI communication from NVIDIA GPU device memory
• High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU)
• High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)
• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node
• Optimized and tuned collectives for GPU device buffers
• MPI datatype support for point-to-point and collective communication from GPU device buffers
30 HPC Advisory Council Stanford Conference, Feb '13
HPC Advisory Council Stanford Conference, Feb '13 31
Applications-Level Evaluation (LBM)
• LBM-CUDA (Courtesy: Carlos Rosale, TACC) • Lattice Boltzmann Method for multiphase flows with large density ratios • LBM with 1D and 3D decomposition respectively • 1D LBM-CUDA: one process/GPU per node, 16 nodes, 4 groups data grid • 3D LBM-CUDA: one process/GPU per node, 512x512x512 data grid, up to 64 nodes
• Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla M2070, one Mellanox IB QDR MT26428 adapter and 48 GB of main memory
0
20
40
60
80
100
120
140
256*256*256 512*256*256 512*512*256 512*512*512
Ste
p T
ime
(S)
Domain Size X*Y*Z
MPI MPI-GPU
13.7% 12.0%
11.8%
9.4%
1D LBM-CUDA
0
50
100
150
200
250
300
350
400
8 16 32 64
Tota
l Exe
cuti
on
Tim
e (
sec)
Number of GPUs
MPI MPI-GPU5.6%
8.2%
13.5% 15.5%
3D LBM-CUDA
HPC Advisory Council Stanford Conference, Feb '13 32
Applications-Level Evaluation (AWP-ODC)
•AWP-ODC (Courtesy: Yifeng Cui, SDSC) • A seismic modeling code, Gordon Bell Prize finalist at SC 2010 • 128x256x512 data grid per process, 8 nodes
• Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla M2070, one Mellanox IB QDR MT26428 adapter and 48 GB of main memory
0102030405060708090
1 GPU/Proc per Node 2 GPUs/Procs per NodeTota
l Ex
ecu
tio
n T
ime
(se
c)
Configuration
AWP-ODC
MPI MPI-GPU
11.1% 7.9%
Working on Upcoming
GPU-Direct-RDMA
Many Integrated Core (MIC) Architecture
• Intel’s Many Integrated Core (MIC) architecture geared for HPC
• Critical part of their solution for exascale computing
• Knights Ferry (KNF) and Knights Corner (KNC)
• Many low-power x86 processor cores with hardware threads and
wider vector units
• Applications and libraries can run with minor or no modifications
• Enhancing MVAPICH2 for MIC for optimized communication
• Evaluation on TACC-Stampede
33 HPC Advisory Council Stanford Conference, Feb '13
34
MVAPICH2-MIC (based on MVAPICH2 1.9a2), Intel MPI 4.1.0.024 Intel Sandy Bridge (E5-2680) node with 16 cores, SE10P (B0-KNC),
MPSS 4346-16 (Gold), Composer_xe_2013.1.117, and IB FDR MT4099 HCA
MVAPICH2-MIC on TACC Stampede: Intra-node MPI Latency
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2K 8K 32K 128K 512K 2M
Intra-Host (MVAPICH2-MIC)
Intra-MIC (MVAPICH2-MIC)
Intra-Host (IntelMPI)
Intra-MIC (IntelMPI)
Ping-Pong Latency
Message Size (bytes)
Late
ncy
(u
s)
0
500
1000
1500
2000
2500
3000
2K 8K 32K 128K 512K 2M
Host-MIC (MVAPICH2-MIC)Host-MIC (IntelMPI)
Ping-Pong Latency
Message Size (bytes)
Late
ncy
(u
s)
HPC Advisory Council Stanford Conference, Feb '13
35
MVAPICH2-MIC on TACC Stampede: Intra-node MPI Bandwidth
MVAPICH2-MIC (based on MVAPICH2 1.9a2), Intel MPI 4.1.0.024 Intel Sandy Bridge (E5-2680) node with 16 cores, SE10P (B0-KNC),
MPSS 4346-16 (Gold), Composer_xe_2013.1.117, and IB FDR MT4099 HCA
0
2000
4000
6000
8000
10000
12000
1 16 256 4K 64K 1M
Uni-directional Bandwidth
Intra-Host (MVAPICH2-MIC)
Intra-MIC (MVAPICH2-MIC)
Intra-Host (IntelMPI)
Intra-MIC (IntelMPI)
Message Size (bytes)
Ban
dw
idth
(M
B/s
)
0
1000
2000
3000
4000
5000
6000
7000
8000
1 16 256 4K 64K 1M
Host-MIC (MVAPICH2-MIC)
MIC-Host (MVAPICH2-MIC)
Host-MIC (IntelMPI)
Mic-Host (IntelMPI)
Uni-directional Bandwidth
Message Size (bytes)
Ba
nd
wid
th (
MB
/s)
HPC Advisory Council Stanford Conference, Feb '13
36
MVAPICH2-MIC (based on MVAPICH2 1.9a2), Intel MPI 4.1.0.024 Intel Sandy Bridge (E5-2680) node with 16 cores, SE10P (B0-KNC),
MPSS 4346-16 (Gold), Composer_xe_2013.1.117, and IB FDR MT4099 HCA
MVAPICH2-MIC on TACC Stampede: Inter-node MPI Performance
0
1000
2000
3000
4000
5000
6000
7000
1 16 256 4K 64K 1M
Host-Host (MVAPICH2)
MIC-MIC (MVAPICH2)
Host-Host (IntelMPI)
MIC-MIC (IntelMPI)
Uni-directional Bandwidth
Message Size (bytes)
Ba
nd
wid
th (
MB
/s)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
2K 8K 32K 128K 512K 2M
Host-Host (MVAPICH2)
MIC-MIC (MVAPICH2)
Host-Host (IntelMPI)
MIC-MIC (IntelMPI)
Ping-Pong Latency
Message Size (bytes)
Late
ncy
(u
s)
HPC Advisory Council Stanford Conference, Feb '13
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
– Extremely minimum memory footprint
• Support for GPGPUs and Co-Processors (Intel MIC)
• Scalable Collective communication – Multicore-aware and Hardware-multicast-based
– Topology-aware
– Offload and Non-blocking
– Power-aware
• QoS support for communication and I/O
• Application Scalability
• Hybrid programming (MPI + OpenSHMEM, MPI + UPC, …)
Challenges being Addressed by MVAPICH2 for Exascale
37 HPC Advisory Council Stanford Conference, Feb '13
Shared-memory Aware Collectives
020406080
100120140160
0 4 8 16 32 64 128 256 512
La
ten
cy (
us
)
Message Size (bytes)
MPI_Reduce (4096 cores)
Original
Shared-memory
HPC Advisory Council Stanford Conference, Feb '13
0
1000
2000
3000
4000
5000
0 8 32 128 512 2K 8K
La
ten
cy (
us
)
Message Size (bytes)
MPI_ Allreduce (4096 cores)
Original
Shared-memory
38
MV2_USE_SHMEM_REDUCE=0/1 MV2_USE_SHMEM_ALLREDUCE=0/1
020406080
100120140
128 256 512 1024
Late
ncy
(u
s)
Number of Processes
Pt2Pt Barrier Shared-Mem Barrier
• MVAPICH2 Reduce/Allreduce with 4K cores on TACC Ranger (AMD Barcelona, SDR IB)
• MVAPICH2 Barrier with 1K Intel
Westmere cores , QDR IB
MV2_USE_SHMEM_BARRIER=0/1
HPC Advisory Council Stanford Conference, Feb '13 39
Hardware Multicast-aware MPI_Bcast on Stampede (Preliminary)
05
10152025303540
2 8 32 128 512
Late
ncy
(u
s)
Message Size (Bytes)
Small Messages (102,400 Cores)
Default
Multicast
ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch
050
100150200250300350400450
2K 8K 32K 128K
Late
ncy
(u
s)
Message Size (Bytes)
Large Messages (102,400 Cores)
Default
Multicast
0
5
10
15
20
25
30
Late
ncy
(u
s)
Number of Nodes
16 Byte Message
Default
Multicast
0
50
100
150
200
Late
ncy
(u
s)
Number of Nodes
32 KByte Message
Default
Multicast
Non-contiguous Allocation of Jobs
• Supercomputing systems organized
as racks of nodes interconnected
using complex network
architectures
• Job schedulers used to allocate
compute nodes to various jobs
Line Card Switches
Line Card Switches
Spine Switches
- Busy Core
- Idle Core
New Job
- New Job
40 HPC Advisory Council Stanford Conference, Feb '13
Non-contiguous Allocation of Jobs
• Supercomputing systems organized
as racks of nodes interconnected
using complex network
architectures
• Job schedulers used to allocate
compute nodes to various jobs
• Individual processes belonging to
one job can get scattered
• Primary responsibility of scheduler is
to keep system throughput high
Line Card Switches
Line Card Switches
Spine Switches
- Busy Core
- Idle Core
- New Job
41 HPC Advisory Council Stanford Conference, Feb '13
Impact of Network-Topology Aware Algorithms on Broadcast Performance
0
5
10
15
20
25
30
35
40
128K 256K 512K 1M
Late
ncy
(m
s)
Message Size (Bytes)
No-Topo-Aware
Topo-Aware-DFT
Topo-Aware-BFT
HPC Advisory Council Stanford Conference, Feb '13 42
0
0.2
0.4
0.6
0.8
1
1.2
128 256 512 1K
No
rmal
ize
d L
ate
ncy
Job Size (Processes)
• Impact of topology aware schemes more pronounced as
• Message size increases
• System size scales
• Upto 14% improvement in performance at scale
14%
H. Subramoni, K. Kandalla, J. Vienne, S. Sur, B. Barth, K.Tomko, R. McLay, K. Schulz, and D. K. Panda, Design and Evaluation of Network Topology-/Speed- Aware Broadcast Algorithms for InfiniBand Clusters, Cluster ‘11
HPC Advisory Council Stanford Conference, Feb '13 43
Network-Topology-Aware Placement of Processes Can we design a highly scalable network topology detection service for IB? How do we design the MPI communication library in a network-topology-aware manner to efficiently leverage the topology information generated by our service? What are the potential benefits of using a network-topology-aware MPI library on the performance of parallel scientific applications?
Overall performance and Split up of physical communication for MILC on Ranger
Performance for varying system sizes Default for 2048 core run Topo-Aware for 2048 core run
15%
H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and D. K. Panda, Design of a Scalable
InfiniBand Topology Service to Enable Network-Topology-Aware Placement of Processes, SC'12 . BEST Paper and BEST STUDENT Paper
Finalist
• Reduce network topology discovery time from O(N2hosts) to O(Nhosts)
• 15% improvement in MILC execution time @ 2048 cores
• 15% improvement in Hypre execution time @ 1024 cores
Application
Collective Offload Support in ConnectX-2 (Recv followed by Multi-Send)
• Sender creates a task-list consisting of only
send and wait WQEs
– One send WQE is created for each registered
receiver and is appended to the rear of a
singly linked task-list
– A wait WQE is added to make the ConnectX-2
HCA wait for ACK packet from the receiver
HPC Advisory Council Stanford Conference, Feb '13
InfiniBand HCA
Physical Link
Send Q
Recv Q
Send CQ
Recv CQ
Data Data
MC
Q MQ
44
Task List Send Wait Send Send Send Wait
Application benefits with Non-Blocking Collectives based on CX-2 Collective Offload
45
Modified P3DFFT with Offload-Alltoall does up to 17% better than default version (128 Processes)
K. Kandalla, et. al.. High-Performance and Scalable Non-
Blocking All-to-All with Collective Offload on InfiniBand
Clusters: A Study with Parallel 3D FFT, ISC 2011
17%
00.20.40.60.8
11.2
10 20 30 40 50 60 70
No
rmal
ize
d
Pe
rfo
rman
ce
HPL-Offload HPL-1ring HPL-Host
HPL Problem Size (N) as % of Total Memory
4.5%
Modified HPL with Offload-Bcast does up to 4.5% better than default version (512 Processes)
Modified Pre-Conjugate Gradient Solver with Offload-Allreduce does up to 21.8% better than default version
K. Kandalla, et. al, Designing Non-blocking Broadcast
with Collective Offload on InfiniBand Clusters: A Case
Study with HPL, HotI 2011
K. Kandalla, et. al., Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12
21.8%
HPC Advisory Council Stanford Conference, Feb '13
46
Non-blocking Neighborhood Collectives to minimize Communication Overheads of Irregular Graph Algorithms
HPC Advisory Council Stanford Conference, Feb '13
• Critical to improve communication overheads of irregular graph algorithms to cater to the graph-theoretic demands of emerging ‘’big-data’’ applications • What are the challenges involved in re-designing irregular graph algorithms, such as the 2D BFS, to achieve fine-grained computation/communication overlap? • Can we design network-offload based Non-blocking Neighborhood collectives in a high-performance, scalable manner?
CombBLAS (2D BFS) Application Run-time on Hyperion (Scale=29)
Can Network-Offload based Non-Blocking Neighborhood MPI Collectives Improve Communication Overheads of Irregular Graph Algorithms? K. Kandalla, A. Buluc, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and D. K. Panda, IWPAPS workshop (Cluster 2012)
Communication Overheads in CombBLAS with 1936 Processes on Hyperion
70 %
20 %
6.5 %
47
Power and Energy Savings with Power-Aware Collectives Performance and Power Comparison : MPI_Alltoall with 64 processes on 8 nodes
K. Kandalla, E. P. Mancini, S. Sur and D. K. Panda, “Designing Power Aware Collective Communication Algorithms for
Infiniband Clusters”, ICPP ‘10
CPMD Application Performance and Power Savings
5% 32%
HPC Advisory Council Stanford Conference, Feb '13
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
– Extremely minimum memory footprint
• Support for GPGPUs and Co-Processors (Intel MIC)
• Scalable Collective communication – Multicore-aware and Hardware-multicast-based
– Topology-aware
– Offload and Non-blocking
– Power-aware
• QoS support for communication and I/O
• Application Scalability
• Hybrid programming (MPI + OpenSHMEM, MPI + UPC, …)
Challenges being Addressed by MVAPICH2 for Exascale
48 HPC Advisory Council Stanford Conference, Feb '13
HPC Advisory Council Stanford Conference, Feb '13 49
Minimizing Network Contention w/ QoS-Aware Data-Staging
R. Rajachandrasekar, J. Jaswani, H. Subramoni and D. K. Panda, Minimizing Network Contention in InfiniBand Clusters
with a QoS-Aware Data-Staging Framework, IEEE Cluster, Sept. 2012
• Asynchronous I/O introduces contention for network-resources • How should data be orchestrated in a data-staging architecture to eliminate such contention? • Can the QoS capabilities provided by cutting-edge interconnect technologies be leveraged by parallel filesystems to minimize network contention?
• Reduces runtime overhead from 17.9% to 8% and from 32.8% to 9.31%, in case of AWP and NAS-CG applications respectively
MPI Message Latency
MPI Message Bandwidth
8%17.9%
0.9
0.95
1
1.05
1.1
1.15
1.2
default with I/O noise
I/O noise isolated
Anelastic Wave Propagation(64 MPI processes)
Normalized Runtime
32.8% 9.31%
0.9
1
1.1
1.2
1.3
1.4
default with I/O noise
I/O noise isolated
NAS Parallel BenchmarkConjugate Gradient Class D
(64 MPI processes)
Normalized Runtime
Courtesy: John Schmidt, Martin Berzins; University of Utah
HPC Advisory Council Stanford Conference, Feb '13 50
Application Level Performance on Stampede using MVAPICH2 (Preliminary)
• Helium Plume simulation using
the Uintah software framework
(www.uintah.utah.edu)
• Weak scaling results
• Helium Plume is an
incompressible turbulent flow
problem
• Uses
• Arches CFD component
• Hypre linear solver
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
– Extremely minimum memory footprint
• Support for GPGPUs and Co-Processors (Intel MIC)
• Scalable Collective communication – Multicore-aware and Hardware-multicast-based
– Topology-aware
– Offload and Non-blocking
– Power-aware
• QoS support for communication and I/O
• Application Scalability
• Hybrid programming (MPI + OpenSHMEM, MPI + UPC, …)
Challenges being Addressed by MVAPICH2 for Exascale
51 HPC Advisory Council Stanford Conference, Feb '13
Partitioned Global Address Space (PGAS) Models
• PGAS Model
– Shared memory abstraction over distributed systems
– Better programmability
• OpenSHMEM (http://openshmem.org/)
– Open specification to standardize the SHMEM (SHared MEMory) model; Library based
• Unified Parallel C (UPC)
– Based on extensions to ISO C99; Compiler based
• Will applications be re-written entirely in PGAS model?
– Probably not; PGAS models are still emerging
– Hybrid MPI+PGAS might be better
HPC Advisory Council Stanford Conference, Feb '13 52
Hybrid (MPI+PGAS) Programming
• Application sub-kernels can be re-written in
MPI/PGAS based on communication characteristics
• Benefits:
– Best of Distributed Computing Model
– Best of Shared Memory Computing Model
• Exascale Roadmap*:
– “Hybrid Programming is a practical way to
program exascale systems”
* The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, International Journal of High Performance Computer Applications, ISSN 1094-3420
Kernel 1 MPI
Kernel 2 MPI
Kernel 3 MPI
Kernel N MPI
HPC Application
Kernel 2 PGAS
Kernel N PGAS
HPC Advisory Council Stanford Conference, Feb '13 53
Current approaches for Hybrid Programming
•Need more network resources •Might lead to deadlock! •Poor performance
• Layering one programming model over another
– Poor performance due to semantics mismatch
• Separate runtime for each programming model
Hybrid (PGAS+MPI) Application
InfiniBand Network
PGAS
callsMPI calls
PGAS
Runtime
MPI Runtime
HPC Advisory Council Stanford Conference, Feb '13 54
Our Approach: Unified Communication Runtime
•Optimal network resource usage
•No deadlock because of single runtime
•Better performance
Hybrid (PGAS+MPI) Application
InfiniBand Network
PGAS
callsMPI calls
PGAS
Runtime
MPI Runtime
Hybrid (OpenSHMEM+MPI) Application
InfiniBand Network
Unified Communication Runtime
PGAS
InterfaceMPI
Interface
PGAS
callsMPI calls
HPC Advisory Council Stanford Conference, Feb '13 55
HPC Advisory Council Stanford Conference, Feb '13 56
Scalable OpenSHMEM and Hybrid (MPI and OpenSHMEM) designs
Hybrid MPI+OpenSHMEM
• Existing Design Based on OpenSHMEM Reference Implementation http://openshmem.org/
• Provides a design over GASNet
• Does not take advantage of all OFED features
• Design scalable and High-Performance OpenSHMEM over OFED
• Designing a Hybrid MPI + OpenSHMEM Model
• Current Model – Separate Runtimes for OpenSHMEM and MPI
• Possible deadlock if both runtimes are not progressed
• Consumes more network resource
• Our Approach – Single Runtime for MPI and OpenSHMEM
Hybrid (OpenSHMEM+MPI) Application
InfiniBand Network
OSU Design
OpenSHMEM
InterfaceMPI
Interface
OpenSHMEM
callsMPI calls
Available in MVAPICH2-X 1.9a2
OpenSHMEM Put/Get Performance
0
1
2
3
4
5
1 2 4 8 16 32 64 128 256 512 1K 2K
Tim
e (
us)
Message Size
OpenSHMEM-GASNet
OpenSHMEM-OSU
0
1
2
3
4
5
6
1 2 4 8 16 32 64 128 256 512 1K 2K
Tim
e (u
s)
Message Size
OpenSHMEM-GASNet
OpenSHMEM-OSU
Put Get
• OSU OpenSHMEM micro-benchmarks (OMB v3.8)
• Better performance for OpenSHMEM put performance with OSU
design
• OpenSHMEM memget performance is almost identical with both
native OpenSHMEM-GASNet and OpenSHMEM-OSU
HPC Advisory Council Stanford Conference, Feb '13 57
HPC Advisory Council Stanford Conference, Feb '13 58
Micro-Benchmark Performance (OpenSHMEM)
0
2
4
6
8
fadd finc cswap swap add inc
Tim
e (u
s)
OpenSHMEM-GASNetOpenSHMEM-OSU
0
2
4
6
8
32 64 128 256
Tim
e (m
s)
No. of Processes
OpenSHMEM-GASNet
OpenSHMEM-OSU41%
Atomic Operations Broadcast (256 bytes)
41% 16%
0
2
4
6
8
10
32 64 128 256
Tim
e (m
s)
No. of Processes
OpenSHMEM-GASNet
OpenSHMEM-OSU35%
Collect (256 bytes)
0
4
8
12
16
32 64 128 256
Tim
e (m
s)
No. of Processes
OpenSHMEM-GASNet
OpenSHMEM-OSU36%
Reduce (256 bytes)
HPC Advisory Council Stanford Conference, Feb '13 59
Performance of Hybrid (OpenSHMEM+MPI) Applications
0
400
800
1200
1600
32 64 128 256 512
Tim
e (s
)
No. of Processes
Hybrid-GASNet
Hybrid-OSU
34%
0
2
4
6
8
32 64 128 256
Tim
e (s
)
No. of Processes
Hybrid-GASNet
Hybrid-OSU
2D Heat Transfer Modeling Graph-500
0
20
40
60
80
32 64 128 256 512
Res
ou
rce
Uti
lizat
ion
(M
B)
No. of Processes
Hybrid-GASNetHybrid-OSU
27%
45%
Network Resource Usage
• Improved Performance for Hybrid
Applications • 34% improvement for 2DHeat
Transfer Modeling with 512 processes
• 45% improvement for Graph500 with 256 processes
• Our approach with single Runtime consumes 27% lesser network resources
J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and
Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012
HPC Advisory Council Stanford Conference, Feb '13 60
Scalable UPC and Hybrid (MPI and UPC) designs
Hybrid MPI+UPC
• Based on Berkeley UPC version 2.14.2 http://upc.lbl.gov/
• Design scalable and High-Performance UPC over OFED
• Designing a Hybrid MPI + UPC Model
• Current Model – Separate Runtimes for UPC and MPI
• Possible deadlock if both runtimes are not progressed
• Consumes more network resource
• Our Approach – Single Runtime for MPI and UPC
Hybrid (MPI+ UPC) Application
InfiniBand Network
OSU Design
GASNet
InterfaceMPI
Interface
UPC calls MPI calls
Available in MVAPICH2-X 1.9a2
• OSU UPC micro-benchmarks (OMB v3.8)
• Better performance for UPC memput performance with UPC-OSU
• UPC memget performance is almost identical with both native UPC-
GASNet and UPC-OSU conduits
UPC Micro-benchmark Performance
0
2
4
6
8
10
1 4 16 64 256 1K 4K 16K
Tim
e (
us)
UPC memput
UPC-GASNet
UPC-OSU
0
2
4
6
8
10
1 4 16 64 256 1K 4K 16K
UPC memget
UPC-GASNet
UPC-OSU
HPC Advisory Council Stanford Conference, Feb '13 61
Evaluation using UPC NAS Benchmarks
• Evaluations using UPC-NAS Benchmarks v2.4 (Class C)
• UPC-OSU performs equal or better than UPC-GASNet
• 11% improvement for CG (256 processes)
• 10% improvement for MG (256 processes)
62 HPC Advisory Council Stanford Conference, Feb '13
0
0.5
1
1.5
2
2.5
64 128 256
MG
0
2
4
6
8
64 128 256
Tim
e (
s)
CG
0
2
4
6
8
10
12
64 128 256
FT
UPC-GASNet
UPC-OSU
Number of Processes
Evaluation of Hybrid MPI+UPC NAS-FT
• Modified NAS FT UPC all-to-all pattern using MPI_Alltoall
• Truly hybrid program
• For FT (Class C, 128 processes)
• 34% improvement over UPC-GASNet
• 30% improvement over UPC-OSU
63 HPC Advisory Council Stanford Conference, Feb '13
0
5
10
15
20
25
30
35
B-64 C-64 B-128 C-128
Tim
e (
s)
NAS Problem Size – System Size
UPC-GASNet
UPC-OSU
Hybrid-OSU
34%
J. Jose, M. Luo, S. Sur and D. K. Panda, Unifying UPC and MPI Runtimes: Experience with MVAPICH, Fourth
Conference on Partitioned Global Address Space Programming Model (PGAS ’10), October 2010
• Performance and Memory scalability toward 500K-1M cores – Dynamically Connected Transport (DCT) service with Connect-IB
• Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …) • Enhanced Optimization for GPU Support and Accelerators
– Extending the GPGPU support (GPU-Direct RDMA)
– Enhanced Support for Intel MIC
• Taking advantage of Collective Offload framework – Including support for non-blocking collectives (MPI 3.0)
• RMA support (as in MPI 3.0)
• Extended topology-aware collectives
• Power-aware collectives
• Support for MPI Tools Interface (as in MPI 3.0)
• Checkpoint-Restart and migration support with in-memory checkpointing
MVAPICH2 – Plans for Exascale
64 HPC Advisory Council Stanford Conference, Feb '13
• InfiniBand with RDMA feature is gaining momentum in HPC and
with best performance and greater usage
• As the HPC community moves to Exascale, new solutions are
needed for designing and implementing programming models
• Demonstrated how such solutions can be designed with
MVAPICH2 and MVAPICH-X and their performance benefits
• Such designs will allow application scientists and engineers to
take advantage of upcoming exascale systems
65
Concluding Remarks
HPC Advisory Council Stanford Conference, Feb '13
HPC Advisory Council Stanford Conference, Feb '13
Funding Acknowledgments
Funding Support by
Equipment Support by
66
HPC Advisory Council Stanford Conference, Feb '13
Personnel Acknowledgments
Current Students
– N. Islam (Ph.D.)
– J. Jose (Ph.D.)
– K. Kandalla (Ph.D.)
– M. Li (Ph.D.)
– M. Luo (Ph.D.)
– S. Potluri (Ph.D.)
– R. Rajachandrasekhar (Ph.D.)
– M. Rahman (Ph.D.)
– H. Subramoni (Ph.D.)
– A. Venkatesh (Ph.D.)
Past Students
– P. Balaji (Ph.D.)
– D. Buntinas (Ph.D.)
– S. Bhagvat (M.S.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
67
Past Research Scientist – S. Sur
Current Post-Docs
– K. Hamidouche
– X. Lu
– H. Wang
Current Programmers
– M. Arnold
– D. Bureddy
– J. Perkins
Past Post-Docs – X. Besseron
– H.-W. Jin
– E. Mancini
– S. Marcarelli
– J. Vienne
HPC Advisory Council Stanford Conference, Feb '13
Web Pointers
http://www.cse.ohio-state.edu/~panda
http://nowlab.cse.ohio-state.edu
MVAPICH Web Page
http://mvapich.cse.ohio-state.edu
68