Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine

The Future of MPI

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

A Presentation at HPC Advisory Council Workshop, Malaga 2012

by




• Trends in Designing Petaflop and Exaflop Systems

• Overview of Programming Models and MPI

• Overview of new MPI-3 features

• Addressing Exascale Challenges in MVAPICH2 Project

• Conclusions

2

Presentation Overview

HPC Advisory Council, Malaga, Spain '12

High-End Computing (HEC): PetaFlop to ExaFlop


Expected to have an ExaFlop system in 2019 -2021!

3

100

PFlops in

2015 1 EFlops in

2019


Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)

Nov. 1996: 0/500 (0%) Jun. 2002: 80/500 (16%) Nov. 2007: 406/500 (81.2%)

Jun. 1997: 1/500 (0.2%) Nov. 2002: 93/500 (18.6%) Jun. 2008: 400/500 (80.0%)

Nov. 1997: 1/500 (0.2%) Jun. 2003: 149/500 (29.8%) Nov. 2008: 410/500 (82.0%)

Jun. 1998: 1/500 (0.2%) Nov. 2003: 208/500 (41.6%) Jun. 2009: 410/500 (82.0%)

Nov. 1998: 2/500 (0.4%) Jun. 2004: 291/500 (58.2%) Nov. 2009: 417/500 (83.4%)

Jun. 1999: 6/500 (1.2%) Nov. 2004: 294/500 (58.8%) Jun. 2010: 424/500 (84.8%)

Nov. 1999: 7/500 (1.4%) Jun. 2005: 304/500 (60.8%) Nov. 2010: 415/500 (83%)

Jun. 2000: 11/500 (2.2%) Nov. 2005: 360/500 (72.0%) Jun. 2011: 411/500 (82.2%)

Nov. 2000: 28/500 (5.6%) Jun. 2006: 364/500 (72.8%) Nov. 2011: 410/500 (82.0%)

Jun. 2001: 33/500 (6.6%) Nov. 2006: 361/500 (72.2%) Jun. 2012: 407/500 (81.4%)

Nov. 2001: 43/500 (8.6%) Jun. 2007: 373/500 (74.6%)

4

• 209 IB Clusters (41.8%) in the June 2012 Top500 list

(http://www.top500.org)

• Installations in the Top 40 (13 systems):


Large-scale InfiniBand Installations

147, 456 cores (Super MUC) in Germany (4th) 78,660 cores (Lomonosov) in Russia (22nd )

77,184 cores (Curie thin nodes) at France/CEA (9th) 137,200 cores (Sunway Blue Light) in China (26th)

120, 640 cores (Nebulae) at China/NSCS (10th) 46,208 cores (Zin) at LLNL (27th)

125,980 cores (Pleiades) at NASA/Ames (11th) 29,440 cores (Mole-8.5) in China (37th)

70,560 cores (Helios) at Japan/IFERC (12th) 42,440 cores (Red Sky) at Sandia (39th)

73,278 cores (Tsubame 2.0) at Japan/GSIC (14th) More are getting installed !

138,368 cores (Tera-100) at France/CEA (17th)

122,400 cores (Roadrunner) at LANL (19th)

5

http://www.top500.org/





• Conclusions

6



HPC Advisory Council, Malaga, Spain '12 7

Parallel Programming Models Overview

P1 P2 P3

Shared Memory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory Memory

Logical shared memory

Shared Memory Model

SHMEM, DSM

Distributed Memory Model

MPI (Message Passing Interface)

Partitioned Global Address Space (PGAS)

Global Arrays, UPC, Chapel, X10, CAF, …

• Programming models provide abstract machine models

• Models can be mapped on different types of systems

– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.

• In this presentation series, we concentrate on MPI first and

then focus on PGAS

• Message Passing Library standardized by MPI Forum

– C, C++ and Fortran

• Goal: portable, efficient and flexible standard for writing

parallel applications

• Not IEEE or ISO standard, but widely considered “industry

standard” for HPC application

• Evolution of MPI

– MPI-1: 1994

– MPI-2: 1996

– MPI-3: on-going effort (2008 – current)

8

MPI Overview and History


• Point-to-point Two-sided Communication

• Collective Communication

• One-sided Communication

• Job Startup

• Parallel I/O

Major MPI Features

9 HPC Advisory Council, Malaga, Spain '12

Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges

Programming Models Message Passing Interface (MPI), Sockets and PGAS

(UPC, OpenSHMEM, Global Arrays)

Applications/Libraries

Networking Technologies (InfiniBand, 1/10/40GigE, RoCE & Intelligent NICs)

Commodity Computing System Architectures

(single, dual, quad, ..) Multi/Many-core architecture

and Accelerators


Point-to-point Communication QoS

Collective Communication

Synchronization & Locks I/O & File Systems

Fault Tolerance

Library or Runtime for Programming Models

10

Exascale System Targets

Systems 2010 2018 Difference Today & 2018

System peak 2 PFlop/s 1 EFlop/s O(1,000)

Power 6 MW ~20 MW (goal)

System memory 0.3 PB 32 – 64 PB O(100)

Node performance 125 GF 1.2 or 15 TF O(10) – O(100)

Node memory BW 25 GB/s 2 – 4 TB/s O(100)

Node concurrency 12 O(1k) or O(10k) O(100) – O(1,000)

Total node interconnect BW 3.5 GB/s 200 – 400 GB/s (1:4 or 1:8 from memory BW)

O(100)

System size (nodes) 18,700 O(100,000) or O(1M) O(10) – O(100)

Total concurrency 225,000 O(billion) + [O(10) to O(100) for latency hiding]

O(10,000)

Storage capacity 15 PB 500 – 1000 PB (>10x system memory is min)

O(10) – O(100)

IO Rates 0.2 TB 60 TB/s O(100)

MTTI Days O(1 day) -O(10)

Courtesy: DOE Exascale Study and Prof. Jack Dongarra


• DARPA Exascale Report – Peter Kogge, Editor and Lead

• Energy and Power Challenge

– Hard to solve power requirements for data movement

• Memory and Storage Challenge

– Hard to achieve high capacity and high data rate

• Concurrency and Locality Challenge

– Management of very large amount of concurrency (billion threads)

• Resiliency Challenge

– Low voltage devices (for low power) introduce more faults


Basic Design Challenges for Exascale Systems

• Power required for data movement operations is one of

the main challenges

• Non-blocking collectives

– Overlap computation and communication

• Much improved One-sided interface

– Reduce synchronization of sender/receiver

• Manage concurrency

– Improved interoperability with PGAS (e.g. UPC, Global Arrays,

OpenShmem)

• Resiliency

– New interface for detecting failures


How does MPI plan to meet these challenges?





• Conclusions

14



• Major features

– Non-blocking Collectives

– Improved One-Sided (RMA) Model

– MPI Tools Interface

• Draft version available : http://meetings.mpi-

forum.org/MPI_3.0_main_page.php

• Targeted to be released during the next few weeks


Major New Features in MPI-3

http://meetings.mpi-forum.org/MPI_3.0_main_page.php



• Enables overlap of computation with communication

• Removes synchronization effects of collective operations

(exception of barrier)

• Completion when local part is complete

• Completion of one non-blocking collective does not imply

completion of other non-blocking collectives

• No “tag” for the collective operation

• Issuing many non-blocking collectives may exhaust

resources

– Quality implementations will ensure that this happens for only

pathological cases


Non-blocking Collective operations

• Non-blocking calls do not match blocking collective calls

– MPI implementation may use different algorithms for blocking and

non-blocking collectives

– Therefore, non-blocking collectives cannot match blocking

collectives

– Blocking collectives: optimized for latency

– Non-blocking collectives: optimized for overlap

• User must call collectives in same order on all ranks

• Progress rules are same as those for point-to-point

• Example new calls: MPIX_Ibarrier, MPIX_Iallreduce, …


Non-blocking Collective Operations (cont’d)

• Easy to express irregular pattern of communication

– Easier than request-response pattern using two-sided

• Decouple data transfer with synchronization

• Better overlap of communication with computation


Recap of One Sided Communication

Rank 0

Rank 2

Rank 1

Rank 3

mem

mem

mem

mem

window

• Remote Memory Access (RMA)

• New proposal has major

improvements

• MPI-2: public and private windows

– Synchronization of windows explicit

• MPI-2: works for non-cache

coherent systems

• MPI-3: two types of windows

– Unified and Separate

– Unified window leverages hardware

cache coherence


Improved One-sided Model

Processor

Private

Window

Public

Window

Incoming RMA Ops

Synchronization

Local Memory Ops

• MPI_Win_create_dynamic

– Window without memory attached

– MPI_Win_attach to attach memory to a window

• MPI_Win_allocate_shared

– Windows with shared memory

– Allows direct loads/store accesses by remote processes

• MPI_Rput, MPI_Rget, MPI_Raccumulate

– Local completion by using MPI_Wait on request objects

• MPI_Get_accumulate, MPI_Fetch_and_op

– Accumulate into target memory, return old data to origin

• MPI_Compare_and_swap

– Atomic compare and swap


Highlights of New MPI-3 RMA Calls

• MPI_Win_lock_all

– Faster way to lock all members of win

• MPI_Win_flush / MPI_Win_flush_all

– Flush all RMA ops to target / window

• MPI_Win_flush_local / MPI_Win_flush_local_all

– Locally complete RMA ops to target / window

• MPI_Win_sync

– Synchronize public and private copies of window

• Overlapping accesses were “erroneous” in MPI-2

– They are “undefined” in MPI-3


Highlights of New MPI-3 RMA Calls (Cont’d)


MPI Tools Interface

• Extended tools support in MPI-3, beyond the PMPI interface

• Provide standardized interface (MPIT) to access MPI internal

information • Configuration and control information

• Eager limit, buffer sizes, . . .

• Performance information

• Time spent in blocking, memory usage, . . .

• Debugging information

• Packet counters, thresholds, . . .

• External tools can build on top of this standard interface


Miscellaneous features

• New FORTRAN bindings

• Non-blocking File I/O

• Removed C++ bindings from standard

• Support for large data counts

• Scalable sparse collectives on process topologies

• Distributed graphs

• Topology aware communicator creation

• Non-collective (group-collective) communicator creation

• Enhanced Datatype support - MPI_TYPE_CREATE_HINDEXED_BLOCK

• Mprobe for hybrid programming on clusters of SMP nodes

• Numerous fixes in the existing proposal





• Conclusions

24



• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and

RDMA over Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1) and MVAPICH2 (MPI-2.2), Available since 2002

– MVAPICH2-X (MPI + PGAS), Available since 2012

– Used by more than 1,960 organizations (HPC Centers, Industry and Universities) in

70 countries

– More than 129,000 downloads from OSU site directly

– Empowering many TOP500 clusters

• 11th ranked 125,980-core cluster (Pleiades) at NASA

• 14th ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology

• 40h ranked 62,976-core cluster (Ranger) at TACC

• and many others

– Available with software stacks of many IB, HSE and server vendors

including Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

• Partner in the upcoming U.S. NSF-TACC Stampede (10-15 PFlop) System

25

MVAPICH2/MVAPICH2-X Software


http://mvapich.cse.ohio-state.edu/



• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided)

– Extremely minimum memory footprint

• Hybrid programming (MPI + OpenMP, MPI + OpenSHMEM, MPI + UPC, …)

• Balancing intra-node and inter-node communication for next generation multi-core (128-1024 cores/node)

– Multiple end-points per node

• Support for efficient multi-threading

• Support for GPGPUs and Accelerators (Intel MIC)

• Scalable Collective communication – Offload

– Non-blocking

– Topology-aware

– Power-aware

• Fault-tolerance/resiliency

• QoS support for communication and I/O

• MPI-3 Support will be available soon

Challenges being Addressed by MVAPICH2 for Exascale


• High Performance and Scalable Inter-node Communication with Hybrid UD-

RC/XRC transport

• Kernel-based zero-copy intra-node communication


– Multi-core Aware, UD-Multicast-Based, Topology-aware and Power-aware

– Exploiting Collective Offload for Non-blocking Collective

• Support for GPGPUs and Accelerators

• Support for MPI-3 RMA Model

• PGAS Support (Hybrid MPI + UPC, Hybrid MPI + OpenSHMEM)

• QoS Support with Virtual Lanes

• Fault Tolerance and Fault Resilience

Overview of MVAPICH2 Designs



One-way Latency: MPI over IB

0.00

1.00

2.00

3.00

4.00

5.00

6.00Small Message Latency

Message Size (bytes)

Late

ncy

(u

s)

1.66

1.56

1.64

1.82

0.99

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch

0.00

50.00

100.00

150.00

200.00

250.00MVAPICH-Qlogic-DDR

MVAPICH-Qlogic-QDR

MVAPICH-ConnectX-DDR

MVAPICH-ConnectX2-PCIe2-QDR

MVAPICH-ConnectX3-PCIe3-FDR

Large Message Latency


Late

ncy

(u

s)


Bandwidth: MPI over IB

0

1000

2000

3000

4000

5000

6000

7000Unidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)


3280

3385

1917

1706

6343

-1000

1000

3000

5000

7000

9000

11000

13000

15000 MVAPICH-Qlogic-DDR

MVAPICH-Qlogic-QDR

MVAPICH-ConnectX-DDR

MVAPICH-ConnectX2-PCIe2-QDR

MVAPICH-ConnectX3-PCIe3-FDR

Bidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)


3341

3704

4407

11643

6521

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch

• Both UD and RC/XRC have benefits

• Evaluate characteristics of all of them and use two sets of transports

in the same application – get the best of both

• Available in MVAPICH (since 1.1), as a separate interface

• Available since MVAPICH2 1.7 as an integrated interface


Hybrid Transport Design (UD/RC/XRC)

M. Koop, T. Jones and D. K. Panda, “MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over

InfiniBand,” IPDPS ‘08

30

• Experimental Platform: LLNL Hyperion Cluster (Intel Harpertown + QDR IB)

• RC: Default Parameters

• UD: MV2_USE_UD=1

• Hybrid: MV2_HYBRID_ENABLE_THRESHOLD=1


HPCC Random Ring Latency with MVAPICH2 (UD/RC/Hybrid)

31

0

1

2

3

4

5

6

128 256 512 1024

Tim

e (

us)

No. of Processes

UD

Hybrid

RC

38% 26% 40% 30%

0

5000

10000

15000

20000

Ban

dw

idth

(M

B/s

)

Message Size (Bytes)

Bi-Directional Bandwidth

intra-Socket-LiMIC

intra-Socket-Shmem

inter-Socket-LiMIC

inter-Socket-Shmem

0

2000

4000

6000

8000

10000

12000

Ban

dw

idth

(M

B/s

)


Bandwidth

intra-Socket-LiMIC

intra-Socket-Shmem

inter-Socket-LiMIC

inter-Socket-Shmem

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 4 8 16 32 64 128 256 512 1K

Late

ncy

(u

s)


Latency

Intra-Socket Inter-Socket

MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support)


Latest MVAPICH2 1.8

Intel Westmere

10,000MB/s 18,500MB/s

0.19 us

0.45 us


RC/XRC transport




– Exploiting Collective Offload for Non-Blocking Collective


• MPI-3 RMA Support






Shared-memory Aware Collectives

020406080

100120140160

0 4 8 16 32 64 128 256 512

La

ten

cy (

us

)


MPI_Reduce (4096 cores)

Original

Shared-memory


0

1000

2000

3000

4000

5000

0 8 32 128 512 2K 8K

La

ten

cy (

us

)


MPI_ Allreduce (4096 cores)

Original

Shared-memory

34

MV2_USE_SHMEM_REDUCE=0/1 MV2_USE_SHMEM_ALLREDUCE=0/1

020406080

100120140

128 256 512 1024

Late

ncy

(u

s)

Number of Processes

Pt2Pt Barrier Shared-Mem Barrier

• MVAPICH2 Reduce/Allreduce with 4K cores on TACC Ranger (AMD Barcelona, SDR IB)

• MVAPICH2 Barrier with 1K Intel

Westmere cores , QDR IB

MV2_USE_SHMEM_BARRIER=0/1

mailto:[email protected]



Hardware Multicast-aware MPI_Bcast

0

2

4

6

8

10

12

14

16

32 64 128 256 512 1024

Tim

e (u

s)

Number of Processes

16 byte MPI_Bcast

Default

Multicast

0

50

100

150

200

250

300

350

32 64 128 256 512 1024

Tim

e (u

s)

Number of Processes

64KB MPI_Bcast

Default

Multicast

0

500

1000

1500

2000

2500

2K 8K 32K 128K 512K

Tim

e (u

s)


Large Messages (1,024 Cores)

Default

Multicast

0

5

10

15

20

25

1 4 16 64 256 1K

Tim

e (u

s)


Small Messages (1,024 Cores)

Default

Multicast

Available in MVAPICH2 1.9

Non-contiguous Allocation of Jobs

• Supercomputing systems organized

as racks of nodes interconnected

using complex network

architectures

• Job schedulers used to allocate

compute nodes to various jobs

Line Card Switches

Line Card Switches

Spine Switches

- Busy Core

- Idle Core

New Job

- New Job


Non-contiguous Allocation of Jobs

• Supercomputing systems organized

as racks of nodes interconnected

using complex network

architectures

• Job schedulers used to allocate

compute nodes to various jobs

• Individual processes belonging to

one job can get scattered

• Primary responsibility of scheduler is

to keep system throughput high

Line Card Switches

Line Card Switches

Spine Switches

- Busy Core

- Idle Core

- New Job


Impact of Network-Topology Aware Algorithms on Broadcast Performance

0

5

10

15

20

25

30

35

40

128K 256K 512K 1M

Late

ncy

(m

s)


No-Topo-Aware

Topo-Aware-DFT

Topo-Aware-BFT


0

0.2

0.4

0.6

0.8

1

1.2

128 256 512 1K

No

rmal

ize

d L

ate

ncy

Job Size (Processes)

• Impact of topology aware schemes more pronounced as

• Message size increases

• System size scales

• Upto 14% improvement in performance at scale

14%

H. Subramoni, K. Kandalla, J. Vienne, S. Sur, B. Barth, K.Tomko, R. McLay, K. Schulz, and D. K. Panda, Design and Evaluation of Network Topology-/Speed- Aware Broadcast Algorithms for InfiniBand Clusters, Cluster ‘11


Network-Topology-Aware Placement of Processes Can we design a highly scalable network topology detection service for IB? How do we design the MPI communication library in a network-topology-aware manner to efficiently leverage the topology information generated by our service? What are the potential benefits of using a network-topology-aware MPI library on the performance of parallel scientific applications?

Overall performance and Split up of physical communication for MILC on Ranger

Performance for varying system sizes Default for 2048 core run Topo-Aware for 2048 core run

15%

H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and D. K. Panda, Design of a Scalable

InfiniBand Topology Service to Enable Network-Topology-Aware Placement of Processes, SC'12 . BEST Paper and BEST STUDENT Paper

Finalist

• Reduce network topology discovery time from O(N2hosts) to O(Nhosts)

• 15% improvement in MILC execution time @ 2048 cores

• 15% improvement in Hypre execution time @ 1024 cores











Application benefits with Non-Blocking Collectives based on CX-2 Collective Offload

40

Modified P3DFFT with Offload-Alltoall does up to 17% better than default version (128 Processes)

K. Kandalla, et. al.. High-Performance and Scalable Non-Blocking

All-to-All with Collective Offload on InfiniBand Clusters: A Study

with Parallel 3D FFT, ISC 2011

17%

00.20.40.60.8

11.2

10 20 30 40 50 60 70

No

rmal

ize

d

Pe

rfo

rman

ce

HPL-Offload HPL-1ring HPL-Host

HPL Problem Size (N) as % of Total Memory

4.5%

Modified HPL with Offload-Bcast does up to 4.5% better than default version (512 Processes)

Modified Pre-Conjugate Gradient Solver with Offload-Allreduce does up to 21.8% better than default version

K. Kandalla, et. al, Designing Non-blocking Broadcast with

Collective Offload on InfiniBand Clusters: A Case Study

with HPL, HotI 2011

K. Kandalla, et. al., Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12

21.8%


41

Non-blocking Neighborhood Collectives to minimize Communication Overheads of Irregular Graph Algorithms


• Critical to improve communication overheads of irregular graph algorithms to cater to the graph-theoretic demands of emerging ‘’big-data’’ applications • What are the challenges involved in re-designing irregular graph algorithms, such as the 2D BFS, to achieve fine-grained computation/communication overlap? • Can we design network-offload based Non-blocking Neighborhood collectives in a high-performance, scalable manner?

CombBLAS (2D BFS) Application Run-time on Hyperion (Scale=29)

Can Network-Offload based Non-Blocking Neighborhood MPI Collectives Improve Communication Overheads of Irregular Graph Algorithms? K. Kandalla, A. Buluc, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and D. K. Panda (Accepted at Cluster-(IWPAPS) 2012

Communication Overheads in CombBLAS with 1936 Processes on Hyperion

70 %

20 %

6.5 %


42

Power and Energy Savings with Power-Aware Collectives Performance and Power Comparison : MPI_Alltoall with 64 processes on 8 nodes

K. Kandalla, E. P. Mancini, S. Sur and D. K. Panda, “Designing Power Aware Collective Communication Algorithms for Infiniband

Clusters”, ICPP ‘10

CPMD Application Performance and Power Savings

5% 32%



RC/XRC transport




– Exploiting Collective Offload for Non-blocking Collective


• Support for MPI-3 RMA






PCIe

GPU

CPU

NIC

Switch

At Sender:

cudaMemcpy(sbuf, sdev, . . .);

MPI_Send(sbuf, size, . . .);

At Receiver:

MPI_Recv(rbuf, size, . . .);

cudaMemcpy(rdev, rbuf, . . .);

Sample Code - Without MPI integration

• Naïve implementation with standard MPI and CUDA

• High Productivity and Poor Performance


PCIe

GPU

CPU

NIC

Switch

At Sender: for (j = 0; j < pipeline_len; j++)

cudaMemcpyAsync(sbuf + j * blk, sdev + j * blksz,. . .);

for (j = 0; j < pipeline_len; j++) {

while (result != cudaSucess) {

result = cudaStreamQuery(…);

if(j > 0) MPI_Test(…);

}

MPI_Isend(sbuf + j * block_sz, blksz . . .);

}

MPI_Waitall();

Sample Code – User Optimized Code

• Pipelining at user level with non-blocking MPI and CUDA interfaces

• Code at Sender side (and repeated at Receiver side)

• User-level copying may not match with internal MPI design

• High Performance and Poor Productivity


Can this be done within MPI Library?

• Support GPU to GPU communication through standard MPI

interfaces

– e.g. enable MPI_Send, MPI_Recv from/to GPU memory

• Provide high performance without exposing low level details

to the programmer

– Pipelined data transfer which automatically provides optimizations

inside MPI library without user tuning

• A new Design incorporated in MVAPICH2 to support this

functionality


At Sender:

At Receiver:

MPI_Recv(r_device, size, …);

inside MVAPICH2

Sample Code – MVAPICH2-GPU

• MVAPICH2-GPU: standard MPI interfaces used

• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)

• Overlaps data movement from GPU with RDMA transfers

• High Performance and High Productivity


MPI_Send(s_device, size, …);

MPI-Level Two-sided Communication

• 45% improvement compared with a naïve user-level implementation

(Memcpy+Send), for 4MB messages

• 24% improvement compared with an advanced user-level implementation

(MemcpyAsync+Isend), for 4MB messages

0

500

1000

1500

2000

2500

3000

32K 64K 128K 256K 512K 1M 2M 4M

Tim

e (

us)


Memcpy+Send

MemcpyAsync+Isend

MVAPICH2-GPU

H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur and D. K. Panda, MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters, ISC ‘11


Other MPI Operations and Optimizations for GPU Buffers

• Overlap optimizations for

– One-sided Communication

– Collectives

– Communication with Datatypes

• Optimized Designs for multi-GPUs per node

– Use CUDA IPC (in CUDA 4.1), to avoid copy through host memory

49

• H. Wang, S. Potluri, M. Luo, A. Singh, X. Ouyang, S. Sur and D. K. Panda, Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2, IEEE Cluster '11, Sept. 2011


• A. Singh, S. Potluri, H. Wang, K. Kandalla, S. Sur and D. K. Panda, MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits, Workshop on Parallel Programming on Accelerator Clusters (PPAC '11), held in conjunction with Cluster '11, Sept. 2011

• S. Potluri et al. Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication, Workshop on Accelerators and Hybrid Exascale Systems(ASHES), to be held in conjunction with IPDPS 2012, May 2012

MVAPICH2 1.8 and 1.9a Release

• Support for MPI communication from NVIDIA GPU device memory

• High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU)

• High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)

• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node

• Optimized and tuned collectives for GPU device buffers

• MPI datatype support for point-to-point and collective communication from GPU device buffers



Applications-Level Evaluation

• LBM-CUDA (Courtesy: Carlos Rosale, TACC) • Lattice Boltzmann Method for multiphase flows with large density ratios • one process/GPU per node, 16 nodes

• AWP-ODC (Courtesy: Yifeng Cui, SDSC) • a seismic modeling code, Gordon Bell Prize finalist at SC 2010 • 128x256x512 data grid per process, 8 nodes

• Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla M2070, one Mellanox IB QDR MT26428 adapter and 48 GB of main memory

0

20

40

60

80

100

120

140

256*256*256 512*256*256 512*512*256 512*512*512

Ste

p T

ime

(S)

Domain Size X*Y*Z

MPI MPI-GPU

13.7% 12.0%

11.8%

9.4%

0102030405060708090

1 GPU/Proc per Node 2 GPUs/Procs per NodeTota

l Ex

ecu

tio

n T

ime

(se

c)

Configuration

AWP-ODC

MPI MPI-GPU

11.1% 7.9%

LBM-CUDA

Many Integrated Core (MIC) Architecture

• Intel’s Many Integrated Core (MIC) architecture geared for HPC

• Critical part of their solution for exascale computing

• Knights Ferry (KNF) and Knights Corner (KNC)

• Many low-power x86 processor cores with hardware threads and

wider vector units

• Applications and libraries can run with minor or no modifications

• Using MVAPICH2 for communication within a KNF coprocessor

– out-of-the box (Default)

– shared memory designs tuned for KNF (Optimized V1)

– designs enhanced using lower level API (Optimized V2)


53

Point-to-Point Communication

Knights Ferry - D0 1.20GHz card - 32 cores Alpha 9 MIC software stack with an additional pre-alpha patch

0.0

0.2

0.4

0.6

0.8

1.0

0 2 8 32 128 512 2K 8K

No

rma

lize

d L

ate

ncy


Latency

Default

Optimized V1

Optimized V2

13% 17%

0.0

0.2

0.4

0.6

0.8

1.0

32K 128K 512K 2M

No

rma

lize

d L

ate

ncy


Latency

Default

Optimized V1

Optimized V2

70% 80%

0.0

0.2

0.4

0.6

0.8

1.0

1 4 16 64 256 1K 4K 16K

No

rma

lize

d B

an

dw

idth


Bandwidth

Default

Optimized V1

Optimized V2

0.0

0.2

0.4

0.6

0.8

1.0

32K 128K 512K 2M

No

rma

lize

d B

an

dw

idth


Bandwidth

Default

Optimized V1

Optimized V2

15%

4X

9.5X


54

Collective Communication

0.0

0.2

0.4

0.6

0.8

1.0

0 2 8 32 128 512

No

rma

lize

d L

ate

ncy


Broadcast

Default

Optimized V1

Optimized V2

17%

0.0

0.2

0.4

0.6

0.8

1.0

64K 128K 256K 512K 1M

No

rma

lize

d L

ate

ncy


Scatter

Default

Optimized V1

Optimized V2

72% 80%

S. Potluri, K. Tomko, D. Bureddy and D. K. Panda – Intra-MIC MPI Communication using MVPAICH2: Early Experience, TACC-Intel Highly-Parallel Computing Symposium, April 2012 – Best Student Paper Award

0.0

0.2

0.4

0.6

0.8

1.0

2K 8K 32K 128K 512K

No

rma

lize

d L

ate

ncy


Broadcast

Default

Optimized V1

Optimized V2

20% 26%



RC/XRC transport




– Exploiting Collective Offload for Non-blocking collective






MPI Design Challenges



Support for MPI-3 RMA Model

MPI-3 One Sided Communication

Accumulate Ordering Undefined Conflicting

Accesses

Separate and Unified

Windows

Window Creation

• Win_allocate

• Win_allocate_shared

• Win_create_dynamic, Win_attach, Win_detach

Synchronization

• Lock_all, Unlock_all

• Win_flush, Win_flush_local, Win_flush_all, Win_flush_local_all

• Win_sync

Communication

• Get_accumulate

• Rput, Rget, Raccumulate, Rget_accumulate

• Fetch_and_op, Compare_and_swap



• Flush Operations

– Local and remote completions bundled in MPI-2

– Considerable overhead on networks like IB - semantics and cost

of local and remote completions are different

– Flush operations allow more efficient check for completions

• Request-based Operations

– Current semantics provide bulk synchronization

– Request based operations return an MPI Request, can be polled

individually

– Allow for better overlap with direct one-sided design in MVAPICH2

• Dynamic Windows

– Allows users to attach and detach memory dynamically, key is to hide

the overheads (such as exchange of RDMA key exchange)


0

1

2

3

4

5

6

7

8

9

1 4 16 64 256 1K 4K

Tim

e (

use

c)


Put+Unlock Put+Flush_local Put+Flush

0

20

40

60

80

100

2K 8K 32K 128K 512K 2M

Pe

rce

nta

ge O

verl

ap


Lock-Unlock Request Ops

• Flush_local allows for a faster check on local completions

• Request-based operations provide much better overlap than

Lock-Unlock, in a Get-Compute-Put communication pattern


S. Potluri, S. Sur, D. Bureddy and D. K. Panda – Design and Implementation of Key Proposed MPI-3 One-Sided Communication Semantics on InfiniBand, EuroMPI 2011, September 2011


RC/XRC transport




– Exploiting Collective Offload for Non-blocking collective






MPI Design Challenges


• Library-based

– Global Arrays

– OpenSHMEM

• Compiler-based

– Unified Parallel C (UPC)

– Co-Array Fortran (CAF)

• HPCS Language-based

– X10

– Chapel

– Fortress


Different Approaches for Supporting PGAS Models


Scalable OpenSHMEM and Hybrid (MPI and OpenSHMEM) designs

Hybrid MPI+OpenSHMEM

• Based on OpenSHMEM Reference Implementation http://openshmem.org/

• Provides a design over GASNet

• Does not take advantage of all OFED features

• Design scalable and High-Performance OpenSHMEM over OFED

• Designing a Hybrid MPI + OpenSHMEM Model

• Current Model – Separate Runtimes for OpenSHMEM and MPI

• Possible deadlock if both runtimes are not progressed

• Consumes more network resource

• Our Approach – Single Runtime for MPI and OpenSHMEM

Hybrid (OpenSHMEM+MPI) Application

InfiniBand Network

OSU Design

OpenSHMEM

InterfaceMPI

Interface

OpenSHMEM

callsMPI calls

Available in MVAPICH2-X

http://openshmem.org/





Micro-Benchmark Performance (OpenSHMEM)

0

2

4

6

8

fadd finc cswap swap add inc

Tim

e (u

s)

OpenSHMEM-GASNetOpenSHMEM-OSU

0

2

4

6

8

32 64 128 256

Tim

e (m

s)

No. of Processes

OpenSHMEM-GASNet

OpenSHMEM-OSU41%

Atomic Operations Broadcast (256 bytes)

41% 16%

0

2

4

6

8

10

32 64 128 256

Tim

e (m

s)

No. of Processes

OpenSHMEM-GASNet

OpenSHMEM-OSU35%

Collect (256 bytes)

0

4

8

12

16

32 64 128 256

Tim

e (m

s)

No. of Processes

OpenSHMEM-GASNet

OpenSHMEM-OSU36%

Reduce (256 bytes)


Performance of Hybrid (OpenSHMEM+MPI) Applications

0

400

800

1200

1600

32 64 128 256 512

Tim

e (s

)

No. of Processes

Hybrid-GASNet

Hybrid-OSU

34%

0

2

4

6

8

32 64 128 256

Tim

e (s

)

No. of Processes

Hybrid-GASNet

Hybrid-OSU

2D Heat Transfer Modeling Graph-500

0

20

40

60

80

32 64 128 256 512

Res

ou

rce

Uti

lizat

ion

(M

B)

No. of Processes

Hybrid-GASNetHybrid-OSU

27%

45%

Network Resource Usage

• Improved Performance for Hybrid

Applications

• 34% improvement for 2DHeat Transfer Modeling with 512 processes

• 45% improvement for Graph500 with 256 processes

• Our approach with single Runtime consumes 27% lesser network resources

J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and

Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012

Evaluation of Hybrid MPI+UPC NAS-FT

0

5

10

15

20

25

30

35

B-64 C-64 B-128 C-128

Tim

e (s

ec)

Class-Processes

GASNet-UCR

GASNet-IBV

GASNet-MPI

Hybrid

• Modified NAS FT UPC all-to-all pattern using MPI_Alltoall

• Truly hybrid program

• 34% improvement for FT (C, 128)


Hybrid MPI + UPC Support

Will be Available in future

MVAPICH2-X Release

• Performance and Memory scalability toward 500K-1M cores – Dynamically Connected Transport (DCT) service with Connect-IB

• Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …) • Enhanced Optimization for GPU Support and Accelerators

– Extending the GPGPU support (GPU-Direct RDMA)

– Support for Intel MIC

• Taking advantage of Collective Offload framework – Including support for non-blocking collectives (MPI 3.0)

• RMA support (as in MPI 3.0)

• Extended topology-aware collectives

• Power-aware collectives

• Support for MPI Tools Interface (as in MPI 3.0)

• Checkpoint-Restart and migration support with in-memory checkpointing

• QoS-aware I/O and checkpointing

MVAPICH2 – Plans for Exascale


• Presented challenges for designing programming models for Exascale systems

• Overview of MPI and MPI-3 Features

• How MVAPICH2 is addressing some of these challenges

• MPI-3 Plans for MVAPICH2

• MVAPICH2 plans for Exascale systems

Conclusions



Funding Acknowledgments

Funding Support by

Equipment Support by

67


Personnel Acknowledgments

Current Students – J. Chen (Ph.D.)

– N. Islam (Ph.D.)

– J. Jose (Ph.D.)

– K. Kandalla (Ph.D.)

– M. Li (Ph.D.)

– M. Luo (Ph.D.)

– S. Potluri (Ph.D.)

– R. Rajachandrasekhar (Ph.D.)

– M. Rahman (Ph.D.)

– H. Subramoni (Ph.D.)

– A. Venkatesh (Ph.D.)

Past Students – P. Balaji (Ph.D.)

– D. Buntinas (Ph.D.)

– S. Bhagvat (M.S.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

68

Past Research Scientist – S. Sur

Current Post-Docs – X. Lu

– J. Vienne

– H. Wang

Current Programmers – M. Arnold

– D. Bureddy

– J. Perkins

Past Post-Docs – X. Besseron

– H.-W. Jin

– E. Mancini

– S. Marcarelli


Web Pointers


http://nowlab.cse.ohio-state.edu

MVAPICH Web Page

http://mvapich.cse.ohio-state.edu

[email protected]

69





http://nowlab.cse.ohio-state.edu/










Documents

Future of MPI€¦ · MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •Programming models provide abstract machine