EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS … · Sreeram Potluri, Anshuman Goswami – NVIDIA Manjunath Gorentla Venkata, Neena Imam - ORNL EFFICIENT BREADTH FIRST SEARCH

Sreeram Potluri, Anshuman Goswami – NVIDIA

Manjunath Gorentla Venkata, Neena Imam - ORNL

EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM

2

SCOPE OF THE WORK

Reliance on CPU for communication limits scaling on GPU clusters

NVSHMEM provides GPU kernel-side OpenSHMEM API

NVLink is a high-bandwidth memory-like fabric between GPUs

Effectiveness of NVSHMEM over NVLink with Breadth First Search

CPU-initiated MPI vs GPU-initiated SHMEM in a particular implementation of BFS

3

AGENDA

GPU Programming Models: MPI and SHMEM

NVSHMEM Overview

DGX and NVLink

MPI+CUDA implementation of BFS

SHMEM implementation of BFS

Performance Evaluation

Conclusion and Future Work

4

GPU CLUSTER PROGRAMMING Offload model

Compute on GPU

Communication from CPU

Synchronization at boundaries

Overheads on GPU Clusters

Offload latencies

Synchronization overheads

Limits strong scaling

More CPU means more power

void 2dstencil (u, v, …)

{

for (timestep = 0; …) {

interior_compute_kernel <<<…>>> (…)

pack_kernel <<<…>>> (…)

cudaStreamSynchronize(…)

MPI_Irecv(…)

MPI_Isend(…)

MPI_Waitall(…)

unpack_kernel <<<…>>> (…)

boundary_compute_kernel <<<…>>> (…)

…

}

}

5

GPU-CENTRIC COMMUNICATION

GPU capabilities Warp scheduling to hide latencies to global memory Implicit coalescing of loads/stores to achieve efficiency

CUDA helps to program to these Should also benefit when accessing data over the network Direct accesses to remote memory simplifies programming Achieving efficiency while making it easier to program Continuous fine-grained accesses smooths traffic over the network

6

GPU GLOBAL ADDRESS SPACE

Sym

metr

ic H

eap

7

COMMUNICATION FROM CUDA KERNELS

Long running CUDA kernels

Communication within parallelism

__global__ void 2dstencil (u, v, sync, …) { for(timestep = 0; …) { if (i+1 > nx) { v[i+1] = shmem_float_g (v[1], rightpe); } if (i-1 < 1) { v[i-1] = shmem_float_g (v[nx], leftpe); } u[i] = (u[i] + (v[i+1] + v[i-1] . . . if (i < 2) { shmem_int_p (sync + i, 1, peers[i]); shmem_quiet(); shmem_wait_until (sync + i, EQ, 1); } //intra-kernel sync … } }

void 2dstencil (u, v, …)

{

stencil_kernel <<<…>>> (…)

}

8

Implementation of OpenSHMEM Specification (1.3)

Symmetric heap on GPU memory

Implements multi-threading support as being discussed in PR #43

Extensions for GPUs

CUDA stream-based SHMEM communication

Thread cooperative SHMEM communication

Expected availability: Q4 2017 (single OS Multi-GPU nodes: NVLink, PCIe, QPI)

NVSHMEM

HOST/GPU HOST ONLY

Library setup, exit and query

Memory management

shmem_ptr

Remote memory access

Atomic memory operations

Collectives

Point-to-point synchronization

Memory ordering

Locking routines

9

DGX-1 ARCHITECTURE

P100

P100

P100

P100

P100

P100

P100

P100

PLX switch

PLX switch

PLX switch

PLX switch

CPU CPU

IB 100 GB/sec

IB 100 GB/sec

IB 100 GB/sec

IB 100 GB/sec

10

Inter-GPU Copies – Single Process

GPU 0

CPU

GPU 1

cudaSetDevice(0); cudaMalloc(&buf0, size) cudaSetDevice(1); cudaMalloc(&buf1, size) ---------- ---------- cudaMemcpy (buf0, buf1, size, cudaMemcpyDefault )

Unified Virtual Addressing available since CUDA 4.0 (2011)

11

GPUDirect P2P: by-passing CPU memory

cudaSetDevice(0); cudaMalloc(&buf0, size); cudaCanAccessPeer (&access, 0, 1); assert(access == 1); cudaEnablePeerAccess (1, 0); cudaSetDevice(1); cudaMalloc(&buf1, size); ---------- ---------- cudaSetDevice (0); cudaMemcpy (buf0, buf1, size, cudaMemcpyDefault);

cudaDeviceCanAccessPeer(int *access, int device, int peerdevice); cudaEnablePeerAccess(int peerDevice, unsigned int flags)

GPU 0

CPU

GPU 1

GPU 0

CPU

GPU 1

Can also access (LD/ST) the peer GPU memory from inside a CUDA kernel

PCIe P2P (copy/LD/ST)

NVLink (copy/LD/ST

/atomics)

12

cudaIpcGetMemHandle(), cudaIpcOpenMemHandle() & cudaDeviceEnablePeerAccess() Using GPU memory from another process

Physical GPU Memory Existing allocation

Get handle and send to other process

Opening handle creates mapping in receiving

process

H

GPU-0 VAs for Process 1

GPU-1 VAs for Process 2

GPU-0

GPU-1

Available for access by another process AND from a second GPU use peering

13

NVLINK1 W/ PASCAL

0

20

40

60

80

100

120

PCIe 1 NVLink1 4x NVLink1

GB

/se

c

Copy using DMA


0

20

40

60

80

100

120


GB

/se

c

LD/ST - Linear


0

2

4

6

8

10


GB

/se

c

LD/ST - Random Access


1.48x

5.89x

1.35x

5.43x

2.64x

10.44x

• GPU memory model is relaxed and scoped

• Same memory model works within/across GPUs

14

BREADTH FIRST SEARCH (BFS)

Baseline: an MPI-based multi-GPU implementation of BFS by M.Bisson et. al.

Has been shown to scale to thousands of GPUs on the Tsubame supercomputer

Loosely follows the Graph 500 specification

R-MAT generator, mean performance over 64 BFS operations

Uses 32-bit vertices instead of required 48-bit representation

Uses a parallel level-synchronous algorithm

Frontier: list of vertices from which traversal starts in a give level Expand: traverse graph to find un-visited neighbors of frontier vertices

Exchange: exchange list of newly found vertices

Append: remove redundancy and build the frontier for next level

15

DATA PARTITIONING

RC

P0,0

PR-1,0

P1,0

P0,1

PR-1,1

P1,1

P0,C-1

PR-1,C-1

P1,C-1

P0,0

PR-1,0

P1,0

P0,1

PR-1,1

P1,1

P0,C-1

PR-1,C-1

P1,C-1

P0,0

PR-1,0

P1,0

P0,1

PR-1,1

P1,1

P0,C-1

PR-1,C-1

P1,C-1

N/C

0

R-1

2

R

2R-1

R+1

R(C-1)

RC-1

R(C-1)+1

0 1 C-1

0R + i

1R + i

jR + i

(C-1)R + i

Partition at Process Pi,j

Process grid: R rows x C columns

Adjacency list of a vertex is distributed along process column

Destination vertex of each edge is in the same process row

16

BFS USING MPI

frontier list at one processor column is populated with the seed vertex

while (frontier not empty)

{

Expand: local expansion

Horizontal Exchange: Alltoall exchange newly discovered vertices (happens only along grid rows)

Append: build new frontier vertex list

Vertical exchange: Allgather frontier list along grid columns (happens only along columns)

Convergence check: Allreduce frontier list size among all processors

}

17

EXCHANGE USING MPI

Vertex lists are exchanged in two forms:

Using a vertex list (one integer per vertex)

Using a bitmap (one bit per vertex)

Format is statically picked based on the level of traversal

bitmap when number of discovered vertices is expected to be large

vertex list when number of discovered vertices is expected to be large

Communication implemented using non-blocking MPI sends and receives

Enhanced the base version to using CUDA-aware MPI

internally uses CUDA IPC/P2P to take advantage of direct PCIe or NVLink connections

18

0

5

10

15

20

25

20 21 22 23 24 25

GTE

PS

Graph Size (scale)

baseline cuMPI

PERFORMANCE WITH CUDA-AWARE MPI

4 K40 GPUs + PCIe 4 P100 GPUs + NVLink

0

2

4

6

8

20 21 22 23 24 25

GTE

PS

Graph Size (scale)

baseline cuMPI

15% 4%

Supermicro server w/ IvyBridge CPU and Broadcom PCIe switch, CUDA 8.0, MVAPICH2 2.2

DGX-1 w/ Broadwell CPU, CUDA 8.0, MVAPICH2 2.2

19

BFS USING NVSHMEM

Fuse communication into CUDA kernels expand <- exchange of newly discovered vertices append <- exchange frontier vertices while (frontier not empty)

{

Expand on GPU: local expansion

Append on GPU: build new frontier vertex list

Convergence check on CPU: Allreduce frontier list size among all processors

}

atomicOr(sbuf + r[k]/BITS(msk), m[k]);

int peer = r[k] / drow_bl;

uint32_t off = disp + r[k]%drow_bl;

q[k] = shmem_int_or ((int *)(msk + off/BITS(msk)), m[k], peer);

gets rid of the MPI exchange code on the host

20

WORKAROUND FOR PCIE

GPU-GPU atomics are not supported over PCIE We take advantage of fact that 32-bit writes are atomic Vertex list is represented as integers, marked with shmem_int_p Higher processing overhead as the vertex map is 32x larger Provides a common version to compare behavior over PCIe and NVLink

21

PERFORMANCE WITH PUTS

0

2

4

6

8

20 21 22 23 24

GTE

PS

Graph Size (Scale)

cuMPI SHMEM-Put

4 K40 GPUs + PCIe

0

5

10

15

20

25

20 21 22 23 24

GTE

PS

Graph Size (Scale)

cuMPI SHMEM-Put

4 P100 GPUs + NVLink

22

0

10

20

30

40

50

21 22 23 24 25 26

GTE

PS

Graph Size (scale)

cuMPI SHMEM-Atomics

PERFORMANCE WITH ATOMICS

8 P100 GPUs + NVLink (2x4) 8 P100 GPUs + NVLink (4x2)

22%

75%

Overall: 12%

52%

0

10

20

30

40

50

21 22 23 24 25 26 27

GTE

PS

Graph Size (scale)

cuMPI SHMEM-Atomics

23

FUTURE WORK

Fairer comparison by removing tracking of predecessor information Hybrid vertex list/bitmap design like in the case of MPI Can offload multiple levels of BFS using stream-based collectives in NVSHMEM Possibility of kernel fusion, using kernel-side collectives in NVSHMEM Evaluate the use of IB to extend NVSHEM across multiple nodes

24

CONCLUSION

NVLink provides a high-bandwidth memory-like fabric between GPUS NVSHMEM provides GPU-side communication API based on OpenSHMEM We presented a study of using GPU-initiated communication in BFS Benefits from reduced overheads in strong scaling cases, compared to using MPI Tradeoff with efficient network utilization for weak scaling

25

THANKS

We thank authors of the baseline BFS code

Mauro Bisson, Massimo Bernaschi, Enrico Mastrostefano: Parallel Distributed Breadth First Search on the Kepler Architecture

This has been supported in part by Oak Ridge National Lab under subcontract #4000145249.

Documents

EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS … · Sreeram Potluri, Anshuman Goswami – NVIDIA Manjunath Gorentla Venkata, Neena Imam - ORNL EFFICIENT BREADTH FIRST SEARCH