7/30/2019 CUDA_May_09_TB_L5
1/51
Lecture 5Multi-GPU computing with CUDA and MPI
Tobias Brandvik
7/30/2019 CUDA_May_09_TB_L5
2/51
The story so far
Getting started (Pullan) An introduction to CUDA for science (Pullan) Developing kernels I (Gratton) Developing kernels II (Gratton) CUDA with multiple GPUs (Brandvik) Medical imaging registration (Ansorge)
7/30/2019 CUDA_May_09_TB_L5
3/51
Agenda
MPI overview The MPI programming model Heat conduction example (CPU) MPI and CUDA Heat conduction example (GPU) Performance measurements
7/30/2019 CUDA_May_09_TB_L5
4/51
MPI overview
MPI is a specification of a Message Passing Interface The specification is a set of functions with prescribed behaviour Not a library there are multiple competing implementations of the
specification
Two popular open-source implementations are Open-MPI andMPICH2
Most MPI implementations from vendors are customized versions ofthese.
7/30/2019 CUDA_May_09_TB_L5
5/51
Why use MPI?
Performance Scalability Stability
7/30/2019 CUDA_May_09_TB_L5
6/51
What hardware does MPI run on?
Distributed memory clusters MPIs popularity is in large part due to the rise of cheap clusters
with commodity x86 nodes over the last 15 years
Ethernet or Infiniband interconnects Shared memory
Some MPI implementations are also suitable for multi-core sharedmemory machines (e.g. high-end desktops)
7/30/2019 CUDA_May_09_TB_L5
7/51
MPI programming model
An MPI program consists of several processes Each process can execute different instructions Each process has its own memory space Processes can only communicate by sending messages to each other
7/30/2019 CUDA_May_09_TB_L5
8/51
MPI programming model
CPU
Memory
CPU
Memory
Rank 0 Rank 1
Communicator Rank: A unique integer identifierfor a process
Communicator: The collectionof processes which maycommunicate with each other
7/30/2019 CUDA_May_09_TB_L5
9/51
A simple example in pseudo-code
We want to copy an array from one processor to another
float a[10];float b[10];
rank 1rank 0
float a[10];float b[10];
recv(b, 10, float, 1, 200)send(a, 10, float 1, 300)wait()
recv(b, 10, float, 0, 300)send(a, 10, float 0, 200)wait()
7/30/2019 CUDA_May_09_TB_L5
10/51
A simple example
We want to copy an array from one processor to another
float a[10];float b[10];
rank 1rank 0
float a[10];float b[10];
recv(b, 10, float, 1, 200)send(a, 10, float 1, 300)wait()
recv(b, 10, float, 0, 300)send(a, 10, float 0, 200)wait()
memory location
7/30/2019 CUDA_May_09_TB_L5
11/51
A simple example
We want to copy an array from one processor to another
float a[10];float b[10];
rank 1rank 0
float a[10];float b[10];
recv(b, 10, float, 1, 200)send(a, 10, float 1, 300)wait()
recv(b, 10, float, 0, 300)send(a, 10, float 0, 200)wait()
memory locationmessagelength
7/30/2019 CUDA_May_09_TB_L5
12/51
A simple example
We want to copy an array from one processor to another
float a[10];float b[10];
rank 1rank 0
float a[10];float b[10];
recv(b, 10, float, 1, 200)send(a, 10, float 1, 300)wait()
recv(b, 10, float, 0, 300)send(a, 10, float 0, 200)wait()
memory locationmessagelength
datatype
7/30/2019 CUDA_May_09_TB_L5
13/51
A simple example
We want to copy an array from one processor to another
float a[10];float b[10];
rank 1rank 0
float a[10];float b[10];
recv(b, 10, float, 1, 200)send(a, 10, float 1, 300)wait()
recv(b, 10, float, 0, 300)send(a, 10, float 0, 200)wait()
memory locationmessagelength
datatype
sending rank
7/30/2019 CUDA_May_09_TB_L5
14/51
A simple example
We want to copy an array from one processor to another
float a[10];float b[10];
rank 1rank 0
float a[10];float b[10];
recv(b, 10, float, 1, 200)send(a, 10, float 1, 300)wait()
recv(b, 10, float, 0, 300)send(a, 10, float 0, 200)wait()
memory locationmessagelength
datatype
sending rank
messagetag
7/30/2019 CUDA_May_09_TB_L5
15/51
The only 7 MPI functions youll ever need
MPI-1 has more than 100 functions But most applications only use a small subset of these In fact, you can write production code using only 7 MPI functions But youll probably use a few more
7/30/2019 CUDA_May_09_TB_L5
16/51
The only 7 MPI functions youll ever need
MPI_Init MPI_Comm_size MPI_Comm_rank MPI_Isend MPI_Irecv MPI_Waitall MPI_Finalize
The MPI specification is defined for C, C++ and Fortran well considerthe C function prototypes
7/30/2019 CUDA_May_09_TB_L5
17/51
A closer look at the functions
int MPI_Init( int *argc, char ***argv ) Initialises the MPI execution environment
int MPI_Comm_size ( MPI_Comm comm, int *size ) Determines the size of the group associated with a communicator
int MPI_Comm_rank ( MPI_Comm comm, int *rank ) Determines the rank of the calling process in the communicator
int MPI_Finalize() Terminates MPI execution environment
7/30/2019 CUDA_May_09_TB_L5
18/51
A closer look at the functions
int MPI_Irecv( void *buf, int count, MPI_Datatype datatype, int source,int tag, MPI_Comm comm, MPI_Request *request )
buf: memory location for message count: number of elements in message datatype: type of elements in message (e.g. MPI_FLOAT) source: rank of source tag: message tag comm: communicator request: communication request (used for checking message status)
7/30/2019 CUDA_May_09_TB_L5
19/51
A closer look at the functions
int MPI_Isend( void *buf, int count, MPI_Datatype datatype, int dest,int tag, MPI_Comm comm, MPI_Request *request )
buf: memory location for message count: number of elements in message datatype: type of elements in message (e.g. MPI_FLOAT) dest: rank of src tag: message tag comm: communicator request: communication request (used for checking message status)
7/30/2019 CUDA_May_09_TB_L5
20/51
The structure of an MPI program
Startup MPI_Init MPI_Comm_size/MPI_Comm_rank Read in and initialise data based on the process rank
Inner loop Post all receives MPI_Irecv Post all sends MPI_Isend Wait for message passing to finish MPI_Waitall Perform computation
End Write out data MPI_Finalize
7/30/2019 CUDA_May_09_TB_L5
21/51
An actual MPI program
#include int main(int argc, char *argv[]) {
MPI_Request req_in, req_out;MPI_Status stat_in, stat_out;float a[10], b[10];int mpi_rank, mpi_size;MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);if (mpi_rank == 0) {
MPI_Irecv(b, 10, MPI_FLOAT, 1, 200, MPI_COMM_WORLD, &req_in);MPI_Isend(a,10, MPI_FLOAT, 1, 300, MPI_COMM_WORLD, &req_out);
}if (mpi_rank == 1) {
MPI_Irecv(b, 10, MPI_FLOAT, 0, 300, MPI_COMM_WORLD, &req_in);MPI_Isend(a,10, MPI_FLOAT, 0, 200, MPI_COMM_WORLD, &req_out);
}MPI_Waitall(1, &req_in, &stat_in);MPI_Waitall(1, &req_out, &stat_out);MPI_Finalize();
}
7/30/2019 CUDA_May_09_TB_L5
22/51
An actual MPI program
#include int main(int argc, char *argv[]) {
MPI_Request req_in, req_out;MPI_Status stat_in, stat_out;float a[10], b[10];int mpi_rank, mpi_size;MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);if (mpi_rank == 0) {
MPI_Irecv(b, 10, MPI_FLOAT, 1, 200, MPI_COMM_WORLD, &req_in);MPI_Isend(a,10, MPI_FLOAT, 1, 300, MPI_COMM_WORLD, &req_out);
}if (mpi_rank == 1) {
MPI_Irecv(b, 10, MPI_FLOAT, 0, 300, MPI_COMM_WORLD, &req_in);MPI_Isend(a,10, MPI_FLOAT, 0, 200, MPI_COMM_WORLD, &req_out);
}MPI_Waitall(1, &req_in, &stat_in);MPI_Waitall(1, &req_out, &stat_out);MPI_Finalize();
}
7/30/2019 CUDA_May_09_TB_L5
23/51
An actual MPI program
#include int main(int argc, char *argv[]) {
MPI_Request req_in, req_out;MPI_Status stat_in, stat_out;float a[10], b[10];int mpi_rank, mpi_size;MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);if (mpi_rank == 0) {
MPI_Irecv(b, 10, MPI_FLOAT, 1, 200, MPI_COMM_WORLD, &req_in);MPI_Isend(a,10, MPI_FLOAT, 1, 300, MPI_COMM_WORLD, &req_out);
}if (mpi_rank == 1) {
MPI_Irecv(b, 10, MPI_FLOAT, 0, 300, MPI_COMM_WORLD, &req_in);MPI_Isend(a,10, MPI_FLOAT, 0, 200, MPI_COMM_WORLD, &req_out);
}MPI_Waitall(1, &req_in, &stat_in);MPI_Waitall(1, &req_out, &stat_out);MPI_Finalize();
7/30/2019 CUDA_May_09_TB_L5
24/51
An actual MPI program
#include
int main(int argc, char *argv[]) {MPI_Request req_in, req_out;MPI_Status stat_in, stat_out;float a[10], b[10];int mpi_rank, mpi_size;MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);if (mpi_rank == 0) {
MPI_Irecv(b, 10, MPI_FLOAT, 1, 200, MPI_COMM_WORLD, &req_in);MPI_Isend(a,10, MPI_FLOAT, 1, 300, MPI_COMM_WORLD, &req_out);
}if (mpi_rank == 1) {
MPI_Irecv(b, 10, MPI_FLOAT, 0, 300, MPI_COMM_WORLD, &req_in);MPI_Isend(a,10, MPI_FLOAT, 0, 200, MPI_COMM_WORLD, &req_out);
}MPI_Waitall(1, &req_in, &stat_in);MPI_Waitall(1, &req_out, &stat_out);MPI_Finalize();
7/30/2019 CUDA_May_09_TB_L5
25/51
An actual MPI program
#include
int main(int argc, char *argv[]) {MPI_Request req_in, req_out;MPI_Status stat_in, stat_out;float a[10], b[10];int mpi_rank, mpi_size;MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);if (mpi_rank == 0) {
MPI_Irecv(b, 10, MPI_FLOAT, 1, 200, MPI_COMM_WORLD, &req_in);MPI_Isend(a,10, MPI_FLOAT, 1, 300, MPI_COMM_WORLD, &req_out);
}if (mpi_rank == 1) {
MPI_Irecv(b, 10, MPI_FLOAT, 0, 300, MPI_COMM_WORLD, &req_in);MPI_Isend(a,10, MPI_FLOAT, 0, 200, MPI_COMM_WORLD, &req_out);
}MPI_Waitall(1, &req_in, &stat_in);MPI_Waitall(1, &req_out, &stat_out);MPI_Finalize();
7/30/2019 CUDA_May_09_TB_L5
26/51
An actual MPI program
#include
int main(int argc, char *argv[]) {MPI_Request req_in, req_out;MPI_Status stat_in, stat_out;float a[10], b[10];int mpi_rank, mpi_size;MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);if (mpi_rank == 0) {
MPI_Irecv(b, 10, MPI_FLOAT, 1, 200, MPI_COMM_WORLD, &req_in);MPI_Isend(a,10, MPI_FLOAT, 1, 300, MPI_COMM_WORLD, &req_out);
}if (mpi_rank == 1) {
MPI_Irecv(b, 10, MPI_FLOAT, 0, 300, MPI_COMM_WORLD, &req_in);MPI_Isend(a,10, MPI_FLOAT, 0, 200, MPI_COMM_WORLD, &req_out);
}MPI_Waitall(1, &req_in, &stat_in);MPI_Waitall(1, &req_out, &stat_out);MPI_Finalize();
7/30/2019 CUDA_May_09_TB_L5
27/51
Compiling and running MPI programs
MPI implementations provide wrappers for popular compilers These are normally named mpicc/mpicxx/mpif77 etc. Running an MPI program normally through mpirun np N ./a.out So, for previous example:
mpicc mpi_example.c mpirun np 2 ./a.out
These commands are for Open-MPI, others may differ slightly
7/30/2019 CUDA_May_09_TB_L5
28/51
Heat conduction example (CPU)
Well modify the head conduction example from earlier to work withmultiple CPUs
7/30/2019 CUDA_May_09_TB_L5
29/51
2D heat conduction
In 2D:T
t=
2T
x2+
2T
y2
7/30/2019 CUDA_May_09_TB_L5
30/51
2D heat conduction
In 2D:
For which a possible finite difference approximation is:
where T is the temperature change over a time tand i,j are indices into
a uniform structured grid (see next slide)
T
t=
2T
x2+
2T
y2
T
t=
Ti+1, j 2Ti, j + Ti1, jx 2
+
Ti, j+1 2Ti, j + Ti, j1y 2
7/30/2019 CUDA_May_09_TB_L5
31/51
Stencil
Update red point using data from blue points (and red point)
7/30/2019 CUDA_May_09_TB_L5
32/51
Finding more parallelism
In the previous lectures, we have tried to find enough parallelism in theproblems for 1000s of threads
This is fine-grained parallelism For MPI, we need another level of parallelism on top of this This is coarse-grained parallelism
7/30/2019 CUDA_May_09_TB_L5
33/51
Domain decomposition and halos
7/30/2019 CUDA_May_09_TB_L5
34/51
Domain decomposition and halos
7/30/2019 CUDA_May_09_TB_L5
35/51
Domain decomposition and halos
7/30/2019 CUDA_May_09_TB_L5
36/51
Domain decomposition and halos
7/30/2019 CUDA_May_09_TB_L5
37/51
Domain decomposition and halos
The fictitious boundary nodes are called halos
7/30/2019 CUDA_May_09_TB_L5
38/51
Message passing pattern
The left-most rank sends data to the right The inner ranks send data to both the left and the right The right-most rank sends data to the left
Rank 0 Rank 1 Rank 2
7/30/2019 CUDA_May_09_TB_L5
39/51
Message buffers
MPI can read and write directly from 2D arrays using an advancedfeature called datatypes (but this is complicated and doesnt work forGPUs)
Instead, we use 1D incoming and outgoing buffers Message-passing strategy is then:
Fill outgoing buffers (2D -> 1D) Send from outgoing buffers, receive into incoming buffers Wait Fill arrays from incoming buffers (1D -> 2D)
7/30/2019 CUDA_May_09_TB_L5
40/51
Heat conduction example (single CPU)
for (i=0; i; nstep; i++) {step_kernel();
}
7/30/2019 CUDA_May_09_TB_L5
41/51
Heat conduction example (multi-CPU)
for (i=0; i; nstep; i++)
fill_out_buffers();if (mpi_rank == 0) { // left
receive_right();send_right();
}if (mpi_rank > 0 && mpi_rank < mpi_size-1) { // inner
receive_left();receive_right();send_left();send_right();
}if (mpi_rank == mpi_size-1) { // right
receive_left();send_left();
}
wait_all();empty_in_buffers();step_kernel();
}
7/30/2019 CUDA_May_09_TB_L5
42/51
Heat conduction example (multi-GPU)
How does all this work when we use GPUs? Just like with CPUs, except we need buffers on both the CPU and the
GPU
Use one MPI process per GPU
7/30/2019 CUDA_May_09_TB_L5
43/51
Message buffers with GPUs
Message-passing strategy with GPUs: Fill outgoing buffers on GPU using a kernel (2D -> 1D) Copy buffers to CPU - cudaMemcpy(DeviceToHost) Send from outgoing buffers, receive into incoming buffers Wait Copy buffers to GPU - cudaMemcpy(HostToDevice) Fill arrays from incoming buffers on GPU using a (1D -> 2D)
7/30/2019 CUDA_May_09_TB_L5
44/51
Heat conduction example (multi-GPU)
for (i=0; i; nstep; i++)fill_out_buffers_cpu();recv();send();wait();empty_in_buffers_cpu();
step_kernel_cpu();
}
7/30/2019 CUDA_May_09_TB_L5
45/51
Heat conduction example (multi-GPU)
for (i=0; i; nstep; i++)fill_out_buffers_gpu(); // (2D -> 1D)cudaMemcpy(DeviceToHost);recv();send();wait();cudaMemcpy(HostToDevice);empty_in_buffers_gpu(); // (1D -> 2D)
step_kernel_gpu();}
7/30/2019 CUDA_May_09_TB_L5
46/51
Compiling code with CUDA and MPI
Can use a .cu file and use nvcc like before, but need to include the MPIheaders and library:
nvcc mpi_example.cu I $HOME/open-mpi/includeL $HOME/open-mpi-lib lmpi
Or, compile C code with mpicc and CUDA code with nvcc and link theresults together into an executable
For simple examples, the first approach is fine, but for complicatedapplications the second approach is cleaner
7/30/2019 CUDA_May_09_TB_L5
47/51
Scaling performance
When benchmarking MPI applications, we look at two issues: Strong scaling how well does the application scale with multiple
processors for a fixed problem size?
Weak scaling how well does the application scale with multipleprocessors for a fixed problem size per processor?
7/30/2019 CUDA_May_09_TB_L5
48/51
GPU scaling issues
Achieving good scaling is more difficult with GPUs for two reasons:1. There is an extra memory copy involved for every message2. The kernels are much faster so the MPI communication becomes
a larger fraction of the overall runtime
7/30/2019 CUDA_May_09_TB_L5
49/51
Typical scaling experience
Performance
Procs Procs
Performance
Weak scaling Strong scaling
Ideal
CPUGPU
7/30/2019 CUDA_May_09_TB_L5
50/51
GPU scaling issues
Achieving good scaling is more difficult with GPUs for two reasons:1. There is an extra cudaMemcpy() involved for every message2. The kernels are much faster so the communication becomes a
larger fraction of the overall runtime
7/30/2019 CUDA_May_09_TB_L5
51/51
Summary
MPI is a good approach to parallelism on distributed memory machines It uses an explicit message-passing model Grid problems can be solved in parallel by using halo nodes You dont need to change your kernels to use MPI, but you will need toadd the message passing logic Using MPI and CUDA together can be done by using both host and
device message buffers
Achieving good scaling is more difficult since the kernels are faster onthe GPU
Recommended