Download pdf - CUDA_May_09_TB_L5

7/30/2019 CUDA_May_09_TB_L5

1/51

Lecture 5Multi-GPU computing with CUDA and MPI

Tobias Brandvik

7/30/2019 CUDA_May_09_TB_L5

2/51

The story so far

Getting started (Pullan) An introduction to CUDA for science (Pullan) Developing kernels I (Gratton) Developing kernels II (Gratton) CUDA with multiple GPUs (Brandvik) Medical imaging registration (Ansorge)

7/30/2019 CUDA_May_09_TB_L5

3/51

Agenda

MPI overview The MPI programming model Heat conduction example (CPU) MPI and CUDA Heat conduction example (GPU) Performance measurements

7/30/2019 CUDA_May_09_TB_L5

4/51

MPI overview

MPI is a specification of a Message Passing Interface The specification is a set of functions with prescribed behaviour Not a library there are multiple competing implementations of the

specification

Two popular open-source implementations are Open-MPI andMPICH2

Most MPI implementations from vendors are customized versions ofthese.

7/30/2019 CUDA_May_09_TB_L5

5/51

Why use MPI?

Performance Scalability Stability

7/30/2019 CUDA_May_09_TB_L5

6/51

What hardware does MPI run on?

Distributed memory clusters MPIs popularity is in large part due to the rise of cheap clusters

with commodity x86 nodes over the last 15 years

Ethernet or Infiniband interconnects Shared memory

Some MPI implementations are also suitable for multi-core sharedmemory machines (e.g. high-end desktops)

7/30/2019 CUDA_May_09_TB_L5

7/51

MPI programming model

An MPI program consists of several processes Each process can execute different instructions Each process has its own memory space Processes can only communicate by sending messages to each other

7/30/2019 CUDA_May_09_TB_L5

8/51

MPI programming model

CPU

Memory

CPU

Memory

Rank 0 Rank 1

Communicator Rank: A unique integer identifierfor a process

Communicator: The collectionof processes which maycommunicate with each other

7/30/2019 CUDA_May_09_TB_L5

9/51

A simple example in pseudo-code

We want to copy an array from one processor to another

float a[10];float b[10];

rank 1rank 0


recv(b, 10, float, 1, 200)send(a, 10, float 1, 300)wait()


7/30/2019 CUDA_May_09_TB_L5

10/51

A simple example



rank 1rank 0




memory location

7/30/2019 CUDA_May_09_TB_L5

11/51

A simple example



rank 1rank 0




memory locationmessagelength

7/30/2019 CUDA_May_09_TB_L5

12/51

A simple example



rank 1rank 0





datatype

7/30/2019 CUDA_May_09_TB_L5

13/51

A simple example



rank 1rank 0





datatype

sending rank

7/30/2019 CUDA_May_09_TB_L5

14/51

A simple example



rank 1rank 0





datatype

sending rank

messagetag

7/30/2019 CUDA_May_09_TB_L5

15/51

The only 7 MPI functions youll ever need

MPI-1 has more than 100 functions But most applications only use a small subset of these In fact, you can write production code using only 7 MPI functions But youll probably use a few more

7/30/2019 CUDA_May_09_TB_L5

16/51

The only 7 MPI functions youll ever need

MPI_Init MPI_Comm_size MPI_Comm_rank MPI_Isend MPI_Irecv MPI_Waitall MPI_Finalize

The MPI specification is defined for C, C++ and Fortran well considerthe C function prototypes

7/30/2019 CUDA_May_09_TB_L5

17/51

A closer look at the functions

int MPI_Init( int *argc, char ***argv ) Initialises the MPI execution environment

int MPI_Comm_size ( MPI_Comm comm, int *size ) Determines the size of the group associated with a communicator

int MPI_Comm_rank ( MPI_Comm comm, int *rank ) Determines the rank of the calling process in the communicator

int MPI_Finalize() Terminates MPI execution environment

7/30/2019 CUDA_May_09_TB_L5

18/51


int MPI_Irecv( void *buf, int count, MPI_Datatype datatype, int source,int tag, MPI_Comm comm, MPI_Request *request )

buf: memory location for message count: number of elements in message datatype: type of elements in message (e.g. MPI_FLOAT) source: rank of source tag: message tag comm: communicator request: communication request (used for checking message status)

7/30/2019 CUDA_May_09_TB_L5

19/51


int MPI_Isend( void *buf, int count, MPI_Datatype datatype, int dest,int tag, MPI_Comm comm, MPI_Request *request )

buf: memory location for message count: number of elements in message datatype: type of elements in message (e.g. MPI_FLOAT) dest: rank of src tag: message tag comm: communicator request: communication request (used for checking message status)

7/30/2019 CUDA_May_09_TB_L5

20/51

The structure of an MPI program

Startup MPI_Init MPI_Comm_size/MPI_Comm_rank Read in and initialise data based on the process rank

Inner loop Post all receives MPI_Irecv Post all sends MPI_Isend Wait for message passing to finish MPI_Waitall Perform computation

End Write out data MPI_Finalize

7/30/2019 CUDA_May_09_TB_L5

21/51

An actual MPI program

#include int main(int argc, char *argv[]) {

MPI_Request req_in, req_out;MPI_Status stat_in, stat_out;float a[10], b[10];int mpi_rank, mpi_size;MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);

MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);if (mpi_rank == 0) {

MPI_Irecv(b, 10, MPI_FLOAT, 1, 200, MPI_COMM_WORLD, &req_in);MPI_Isend(a,10, MPI_FLOAT, 1, 300, MPI_COMM_WORLD, &req_out);

}if (mpi_rank == 1) {


}MPI_Waitall(1, &req_in, &stat_in);MPI_Waitall(1, &req_out, &stat_out);MPI_Finalize();

}

7/30/2019 CUDA_May_09_TB_L5

22/51









}

7/30/2019 CUDA_May_09_TB_L5

23/51









7/30/2019 CUDA_May_09_TB_L5

24/51


#include

int main(int argc, char *argv[]) {MPI_Request req_in, req_out;MPI_Status stat_in, stat_out;float a[10], b[10];int mpi_rank, mpi_size;MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);






7/30/2019 CUDA_May_09_TB_L5

25/51


#include







7/30/2019 CUDA_May_09_TB_L5

26/51


#include







7/30/2019 CUDA_May_09_TB_L5

27/51

Compiling and running MPI programs

MPI implementations provide wrappers for popular compilers These are normally named mpicc/mpicxx/mpif77 etc. Running an MPI program normally through mpirun np N ./a.out So, for previous example:

mpicc mpi_example.c mpirun np 2 ./a.out

These commands are for Open-MPI, others may differ slightly

7/30/2019 CUDA_May_09_TB_L5

28/51

Heat conduction example (CPU)

Well modify the head conduction example from earlier to work withmultiple CPUs

7/30/2019 CUDA_May_09_TB_L5

29/51

2D heat conduction

In 2D:T

t=

2T

x2+

2T

y2

7/30/2019 CUDA_May_09_TB_L5

30/51

2D heat conduction

In 2D:

For which a possible finite difference approximation is:

where T is the temperature change over a time tand i,j are indices into

a uniform structured grid (see next slide)

T

t=

2T

x2+

2T

y2

T

t=

Ti+1, j 2Ti, j + Ti1, jx 2

+

Ti, j+1 2Ti, j + Ti, j1y 2

7/30/2019 CUDA_May_09_TB_L5

31/51

Stencil

Update red point using data from blue points (and red point)

7/30/2019 CUDA_May_09_TB_L5

32/51

Finding more parallelism

In the previous lectures, we have tried to find enough parallelism in theproblems for 1000s of threads

This is fine-grained parallelism For MPI, we need another level of parallelism on top of this This is coarse-grained parallelism

7/30/2019 CUDA_May_09_TB_L5

33/51

Domain decomposition and halos

7/30/2019 CUDA_May_09_TB_L5

34/51


7/30/2019 CUDA_May_09_TB_L5

35/51


7/30/2019 CUDA_May_09_TB_L5

36/51


7/30/2019 CUDA_May_09_TB_L5

37/51


The fictitious boundary nodes are called halos

7/30/2019 CUDA_May_09_TB_L5

38/51

Message passing pattern

The left-most rank sends data to the right The inner ranks send data to both the left and the right The right-most rank sends data to the left

Rank 0 Rank 1 Rank 2

7/30/2019 CUDA_May_09_TB_L5

39/51

Message buffers

MPI can read and write directly from 2D arrays using an advancedfeature called datatypes (but this is complicated and doesnt work forGPUs)

Instead, we use 1D incoming and outgoing buffers Message-passing strategy is then:

Fill outgoing buffers (2D -> 1D) Send from outgoing buffers, receive into incoming buffers Wait Fill arrays from incoming buffers (1D -> 2D)

7/30/2019 CUDA_May_09_TB_L5

40/51

Heat conduction example (single CPU)

for (i=0; i; nstep; i++) {step_kernel();

}

7/30/2019 CUDA_May_09_TB_L5

41/51

Heat conduction example (multi-CPU)

for (i=0; i; nstep; i++)

fill_out_buffers();if (mpi_rank == 0) { // left

receive_right();send_right();

}if (mpi_rank > 0 && mpi_rank < mpi_size-1) { // inner

receive_left();receive_right();send_left();send_right();

}if (mpi_rank == mpi_size-1) { // right

receive_left();send_left();

}

wait_all();empty_in_buffers();step_kernel();

}

7/30/2019 CUDA_May_09_TB_L5

42/51

Heat conduction example (multi-GPU)

How does all this work when we use GPUs? Just like with CPUs, except we need buffers on both the CPU and the

GPU

Use one MPI process per GPU

7/30/2019 CUDA_May_09_TB_L5

43/51

Message buffers with GPUs

Message-passing strategy with GPUs: Fill outgoing buffers on GPU using a kernel (2D -> 1D) Copy buffers to CPU - cudaMemcpy(DeviceToHost) Send from outgoing buffers, receive into incoming buffers Wait Copy buffers to GPU - cudaMemcpy(HostToDevice) Fill arrays from incoming buffers on GPU using a (1D -> 2D)

7/30/2019 CUDA_May_09_TB_L5

44/51


for (i=0; i; nstep; i++)fill_out_buffers_cpu();recv();send();wait();empty_in_buffers_cpu();

step_kernel_cpu();

}

7/30/2019 CUDA_May_09_TB_L5

45/51


for (i=0; i; nstep; i++)fill_out_buffers_gpu(); // (2D -> 1D)cudaMemcpy(DeviceToHost);recv();send();wait();cudaMemcpy(HostToDevice);empty_in_buffers_gpu(); // (1D -> 2D)

step_kernel_gpu();}

7/30/2019 CUDA_May_09_TB_L5

46/51

Compiling code with CUDA and MPI

Can use a .cu file and use nvcc like before, but need to include the MPIheaders and library:

nvcc mpi_example.cu I $HOME/open-mpi/includeL $HOME/open-mpi-lib lmpi

Or, compile C code with mpicc and CUDA code with nvcc and link theresults together into an executable

For simple examples, the first approach is fine, but for complicatedapplications the second approach is cleaner

7/30/2019 CUDA_May_09_TB_L5

47/51

Scaling performance

When benchmarking MPI applications, we look at two issues: Strong scaling how well does the application scale with multiple

processors for a fixed problem size?

Weak scaling how well does the application scale with multipleprocessors for a fixed problem size per processor?

7/30/2019 CUDA_May_09_TB_L5

48/51

GPU scaling issues

Achieving good scaling is more difficult with GPUs for two reasons:1. There is an extra memory copy involved for every message2. The kernels are much faster so the MPI communication becomes

a larger fraction of the overall runtime

7/30/2019 CUDA_May_09_TB_L5

49/51

Typical scaling experience

Performance

Procs Procs

Performance

Weak scaling Strong scaling

Ideal

CPUGPU

7/30/2019 CUDA_May_09_TB_L5

50/51

GPU scaling issues

Achieving good scaling is more difficult with GPUs for two reasons:1. There is an extra cudaMemcpy() involved for every message2. The kernels are much faster so the communication becomes a

larger fraction of the overall runtime

7/30/2019 CUDA_May_09_TB_L5

51/51

Summary

MPI is a good approach to parallelism on distributed memory machines It uses an explicit message-passing model Grid problems can be solved in parallel by using halo nodes You dont need to change your kernels to use MPI, but you will need toadd the message passing logic Using MPI and CUDA together can be done by using both host and

device message buffers

Achieving good scaling is more difficult since the kernels are faster onthe GPU