CS 770G - Parallel Algorithms in Scientific Computing ...cs770g/handout/mpi2.pdfCS 770G - Parallel...

CS 770G - Parallel Algorithms in Scientific Computing

May 14 , 2001Lecture 3

Message-Passing II:MPI Programming

References

• Using MPI: Portable Parallel Programming with the Message-Passing Interface

Gropp, Lusk, Skjellum, MIT.

• Parallel Programming with MPI Pacheco, Morgan Kaufmann

Message Passing Programming

• Separate processors.• Separate address spaces.• Procs execute independently and concurrently.• Procs transfer data cooperatively.• Single Program Multiple Data (SPMD)

– All procs are executing the same program, but operating on different data.

• Multiple Program Multiple Data (MPMD)– Different procs may be executing a different program.

• Common software tools: PVM, MPI.

What is MPI? Why?• Message-passing library specifications

– Message-passing model– Not a compiler specification– Not a specific product

• For parallel computers, clusters & heterogeneous networks.• Designed to permit the development of parallel software

libraries.• Designed to provide access to advanced parallel hardware for:

– End users– Library writers– Tool developers

Who Designed MPI?

• Broad participants.• Vendors:

– IBM, Intel, TMC, Meiko, Cray, Convex, nCube.

• Library writers:– PVM, p4, Zipcode, TCGMSG, Chameleon, Express, Linda.

• Application specialists and consultants– Companies: ARCO, KAI, NAG, Parasoft, Shell, …– Labs: ANL, LANL, LLNL, ORNL, SNL, …– Universities: almost 20.

Why Use MPI?

• Standardization– The only message-passing library which can be considered as

standard.

• Portability– There is no need to modify the source when porting codes

from one platform to another which supports MPI.

• Performance– Vendor implementations should be able to exploit native

hardware to optimize performance.

Why Use MPI? (cont.)

• Availability– A variety of implementations are available, both vendor and

public domain, e.g. MPICH implemented by ANL.

• Functionality– It provides more than 100 subroutine calls.

Features of MPI• General

– Communicators combine context and group for message security.

– Thread safety.

• Point-to-point communication:– Structured buffers and derived datatypes, heterogeneity.– Modes: standard, synchronous, ready (to allow access to fast

protocols), buffered.

• Collective communication:– Both built-in & user-defined collective operators.– Large number of data movement routines– Subgroup defined directly or by topology.

Is MPI Large or Small?

• MPI is large -- over 100 functions– Extensive functionalities requires many functions.

• MPI is small -- 6 functions:– MPI_Init: initialize MPI.– MPI_Comm_size: find out how many procs there are.– MPI_Comm_rank: find out which proc I am.– MPI_Send: send a message.– MPI_Recv: receive a message.– MPI_Finalize: terminate MPI.

Is MPI Large or Small? (cont.)

• MPI is just right– One can access flexibility when it is required.– One need not master all parts of MPI to use it.

Send & Receive• Cooperate data transfer:

• To (from) whom is data sent (received)?

• What is sent?

• How does the receiver identify it?

Data Data

Proc 0 Proc 1

Send Receive

Message Passing: SendSyntax of MPI_Send:

MPI_Send (address, count, datatype, dest, tag, comm)

• (address, count) = a contiguous area in memory containing the message to be sent.

• datatype = type of data, e.g. integer, double precision.– Message length = count * sizeof(datatype)

• dest = integer identifier representing the proc to receive the message.

• tag = nonnegative integer that the destination can use to selectively screen messages.

• comm = communicator = group of procs.

Message Passing: Receive

Syntax of MPI_Recv:

MPI_Recv (address, count, datatype, source, tag,comm, status)

• address, count, datatype, tag, comm the same as MPI_Send.

• source = integer identifier representing the proc to send the message.

• status = information about the message that is received.

SPMD• Proc 0 & proc 1 are actually performing different operations.• However, not necessary to write separate programs for each

proc.• Typically, use conditional statement and proc id to identify

the job of each proc. Example:

status);ORLD,MPI_COMM_W0,MPI_INT,0,,10,MPI_Recv(a

1)(my_idifelse

ORLD);MPI_COMM_W0,MPI_INT,1,,10,MPI_Send(a

0)(my_idif

a[10];int

Deadlock• Example: exchange data between 2 procs:

• MPI_Send is a synchronous operation. If no system buffering, it keeps waiting until a matching receive is posted.

Proc 0 Proc 1

MPI_SendMPI_Recv

Data 1

Data 2

Data 1

MPI_SendMPI_Recv

Deadlock (cont.)

• Both procs are waiting for each other → deadlock.• However, OK if system buffering exists → unsafe

program.

MPI_SendMPI_Recv

Proc 0 Proc 1

Data 1

Data 2

Data 1

MPI_SendMPI_Recv

Deadlock (cont.)• Note: MPI_Recv is blocking and nonbuffered.

• A real deadlock:

• Can be fixed by reordering comm.:

Proc 0 Proc 1MPI_Recv MPI_RecvMPI_Send MPI_Send

Proc 0 Proc 1MPI_Send MPI_RecvMPI_Recv MPI_Send

Buffered / Nonbuffered Comm.• No-buffering (phone calls)

– Proc 0 initiates the send request and ring Proc 1. It waits until Proc 1 is ready to receive. The transmission starts.

– Synchronous comm.: complete only when the message was received by the receiving proc.

• Buffering (beeper)– The message to be sent (by Proc 0) is copied to a system-

controlled block of memory (buffer).– Proc 0 can continue executing the rest of its program.– When Proc 1 is ready to receive the message, the system copies

the buffered message to Proc 1.– Asynchronous comm.: may be completed even though the

receiving proc has not received the message.

Buffered Comm.• Buffering requires system resources, e.g. memory, and

can be slower if the receiving proc is ready at the time of requesting send.

• Application buffer: address space that hold the data.• System buffer: system space for storing messages. In

buffered comm, data in application buffer is copied to/from system buffer.

• MPI allows comm. in buffered mode: MPI_Bsend, MPI_Ibsend.

• User allocates the buffer by: MPI_Buffer_attach (buffer, buffer_size)

• Free the buffer by MPI_Buffer_detach.

Blocking / Nonblocking Comm.• Blocking comm. (McDonald)

– The receiving proc has to wait if the message is not ready.– Different from synchronous comm.– Proc 0 may already buffered the message to system and Proc 1

is ready, but the interconnection network is busy.

• Nonblocking comm. (In & Out)– Proc 1 checks with the system if the message has arrived yet. If

not, continue doing other stuff. Otherwise, get the message from the system.

– Useful when computation and comm. can perform at the same time.

Blocking / Nonblocking Comm. (cont.)• MPI allows both nonblocking send & receive: MPI_Isend, MPI_Irecv.

• In nonblocking send, it identifies an area in memory to serve as a send buffer. Processing continues immediately without waiting for message to be copied out from application buffer.

• The program should not modify the application bufferuntil the nonblocking send has completed.

• Nonblocking comm. can be combined with nonbuffereing: MPI_Issend, or buffering: MPI_Ibsend.

• Use MPI_Wait or MPI_Test to determine if the nonblocking send or receive has completed.

Example: Blocking vs Nonblocking• Data exchange in a ring topology.

• Blocking version:0P 1P

&status);ring_comm,1,0,my_id

MPI_FLOAT,blksize,t,recv_offseMPI_Recv(y

;ring_comm)1,0,my_id

MPI_FLOAT,blksize,t,send_offseMPI_Send(y

blksize;*p)%p)1i((my_idtrecv_offse

blksize;*p)%p)i((my_idtsend_offse

{)ip;i0;(ifor

+−−=+−=

Example: Blocking vs Nonblocking (cont.)• Nonblocking version:

);st,&statusrecv_requeMPI_Wait(&

);st,&statussend_requeMPI_Wait(&

blksize;*p)%p)1-i((my_idtsend_offse

est);&recv_requring_comm,1,0,my_id

MPI_FLOAT,blksize,t,recv_offseyMPI_Irecv(

est);&send_requring_comm,1,0,my_id

MPI_FLOAT,blksize,t,send_offseyMPI_Isend(

blksize;*p)%p)i((my_idtsend_offse

{)i1;-pi0;(ifor

blksize;*p)1-(my_idtrecv_offse

blksize;*my_idtsend_offse

+−−=+−=

++<=+=

Summary: Comm. Modes

• 4 comm. Modes in MPI: standard, buffered, synchronous, ready. They can be either blocking or nonblocking.

• In standard modes (MPI_Send, MPI_Recv), it is up to the system to decide whether messages should be buffered.

• In synch. mode, a send won't complete until a matching receive has been posted and which has begun reception of the data.– MPI_Ssend, MPI_Issend.– No system buffering.

Summary: Comm. Modes (con.t)

• In buffered mode, the completion of a send does not depend on the existence of a matching receive.– MPI_Bsend, MPI_Ibsend.– System buffering by MPI_Buffer_attach & MPI_buffer_detach.

• Ready mode not discussed.

Collective Communications

Comm. Pattern involving all the procs; usually more than 2.• MPI_Barrier: synchronize all procs.• Broadcast (MPI_Bcast)• Reduction (MPI_Reduce)

– All procs contribute data that is combined using a binary op.– E.g. max, min, sum, etc.– One proc obtains the final answer.

• Allreduce (MPI_Allreduce)– Same as MPI_Reduce, but every procs contains the final answer.– Conceptually, MPI_Allreduce = MPI_Reduce + MPI_Bcast, but

more efficient.

An Implementation

• Tree-structured comm.: (find the max among procs)

• Only need log p stages of comm.• Not necessary optimal on a particular architecture.

02 36 7

70 P1 2 4 5 6 PP P P P P P

Example 1: Hello, world!

• #include "mpi.h": basic MPI definitions and data types.• MPI_Init starts MPI.• MPI_Finalize exits MPI.• Note: all non-MPI routines are local; thus printf run on

each proc.

0;return

ze();MPI_Finali

);world!\n"Hello,printf("

);argc,&argvMPI_Init(&

*argv;*char

argc;int

argv)main(argc,int

>< hstdioincludehmpiinclude

.#"."#

Example 2: "Advanced" Hello, world!

• MPI_Comm_rank returns proc id.• MPI_Comm_size returns # of procs.• Note: for some parallel systems, only a few designated

procs can do I/O.• How does the output look like?

0;return

ze();MPI_Finali

size);rank,,n"\%dof%damIworld!Hello,printf("

size);&MM_WORLD,ize(MPI_COMPI_Comm_s

rank);&MM_WORLD,ank(MPI_COMPI_Comm_r

);argc,&argvMPI_Init(&

argv)main(argc,int

Example: Calculate π• Well-known formula:

• Numerical integration (Trapezoidal rule):

0 2 π=+∫ dx

=ba= n20 x . . . . xn-1n-2x1 xxx

.lssubinterva of #,/)(,

.)(21)()()(

21)( 110

=−=+=

++++≈ −∫

nnabhihax

xfxfxfxfhdxxf

Example: Calculate π (cont.)• A sequential function Trap(a,b,n) approx the

integral from a to b of f(x) using trap rule with nsubintervals:

integral;return

f(x);*hintegralintegral

{)in;i1;(ifor

f(b))/2;(f(a)*hintegral

a)/n;(bh

n)b,Trap(a,double

++<=+=

Parallelizing Trap

• Divide the interval [a,b] into p equal subintervals.• Each proc calculates the local approx. integral using trap rule

simultaneously.• Finally, combine the local values to obtain the total integral.

kn/pxProc 0

2n/pxn/px xProc kProc 1

Proc p-1(k+1)n/p xn=b0xa= n-n/px

Parallel Trap Program

);n,b,Trap(aintegralh;*nab

h;*n*my_idaan/p;n

a)/n;(bh1;b0;a128;n

/*locallyruleTrapApply*/

my_id);MM_WORLD,&ank(MPI_COMPI_Comm_r/*idprocmyDetermine*/

p);MM_WORLD,&ize(MPI_COMPI_Comm_s/*procsmanyhowoutFind*/

);argc,&argvMPI_Init(&/*MPIStartup*/

−====

Parallel Trap Program (cont.)

ze();MPI_Finali/*MPIClose*/

}ORLD);MPI_COMM_W

tag,,0,MPI_DOUBLE,integral,1MPI_Send(&{else

integral;totaltotalus);ORLD,&statMPI_COMM_W

tag,k,,MPI_DOUBLE,integral,1MPI_Recv(&{)kp;k1;(kfor

integral;total{0)(my_idif

/*integralsupSum*/

Parallelizing Trap Program (cont.)

• Can/should replace MPI_Send & MPI_Recv by MPI_Reduce.

• Embarrassingly parallel -- no comm. needed during the computations of the local approx. integrals.

Wildcards

• MPI_ANY_TAG, MPI_ANY_SOURCE– MPI_Recv can use it for the tag and source input

arguments.

• May use status output argument to determine the actual source and tag.

• In C, the last parameter of MPI_Recv, status, is a struct with at least 2 members:– status -> MPI_TAG

– status -> MPI_SOURCE

• They return the rank of the proc that sent the message (MPI_SOURCE), and the tag number (MPI_TAG).

Timing• MPI_Wtime() returns the wall-clock time.

start;-finishtime

);MPI_WTime(finish

_WORLD);r(MPI_COMMMPI_Barrie

);MPI_WTime(start

_WORLD);r(MPI_COMMMPI_Barrie

time;finish,start,double

MPI Data Structures

• Suppose in the previous program, proc 0 reads in the values of a, b, & n from standard input, and then broadcasts the values to the other procs.

• Consequently, we need to perform MPI_Bcast 3 times.

• Sending a message is expensive in parallel environment → min. latency.

• Can reduce overhead cost by sending the 3 values in a single message.

• 3 approaches: count, derived datatype, MPI_Pack/Unpack.

(I) count + datatype

• In MPI_Send (MPI_Recv, MPI_Bcast …), we specify the length of data by count.

• Thus, we may group data items having the same datatype: store the data in contigous memory locations (e.g. array).

• Unfortunately, in our case, a & b are doubles but n is an integer.

(II) Derived Data Type• Define an MPI datatype consisting of 2 doubles and 1 integer.• Use MPI_Type_struct.

• A general MPI datatype is a sequence of pairs:

– ti = MPI_datatype– di = displacement in bytes relative to the starting address of the

message

• E.g.

• Then the derived datatype is:

)},(,),,(),,{( 1100 nn dtdtdt K

10 25 30

E,20)}(MPI_DOUBL,E,15),(MPI_DOUBLLE,0),{(MPI_DOUB K

(II) Derived Data Type (cont.)

ORLD);MPI_COMM_Wnewtype,0,&a,1,MPI_Bcast(ype);type,&newtdisp,blklen,truct(3,MPI_Type_s

base;-addressdisp[2]ess);s(&n,&addrMPI_Addres

base;-addressdisp[1]ess);s(&b,&addrMPI_Addres);s(&a,&baseMPI_Addres

0;disp[0]

MPI_INT;type[2];MPI_DOUBLEtype[1]type[0]

1;blklen[2]blklen[1]blklen[0]

1;b0;a128;n

address;base,disp[3],MPI_Ainttype[3];newtype,peMPI_Dataty

blklen[3];n,intb;a,double

(III) MPI_Pack / Unpack• MPI_Pack stores noncontiguous data in contiguous memory

locations.• MPI_Unpack copy data from a contiguous buffer into

noncontiguous memory locations (reverse of MPI_Pack).

ORLD);MPI_COMM_W,0,MPI_PACKEDe,buffer_sizbuffer,MPI_Bcast(

ORLD);MPI_COMM_Wposition,&e,buffer_sizbuffer,MPI_MIT,n,1,MPI_Pack(&

ORLD);MPI_COMM_Wposition,&e,buffer_sizbuffer,,MPI_DOUBLEb,1,MPI_Pack(&

ORLD);MPI_COMM_Wposition,&e,buffer_sizbuffer,,MPI_DOUBLEa,1,MPI_Pack(&

0;position

position;intfer_size];buffer[bufchar

MPI Topology• Cartesian meshes are very common in parallel

programs solving PDEs.• In such programs, comm. patterns resemble closely

with the computational meshes.• The mapping of the comm. topology to the hardware

topology can be made in many ways; some are better than the others.

• Thus, MPI allows the vendor to help optimize this mapping.

• Two types of virtual topologies in MPI: Cartesian & graph topology.

Cartesian Topology

• Create a 4×3 2D-mesh topology:

• MPI_Cart_create(MPI_COMM_WORLD, ndim, dims, periods, reorder, new_comm);– ndim = # of dimensions = 2– dims = # of procs in each direction– dims[0] = 4, dims[1] = 3

(1,1) (3,1)(2,1)(0,1)

(3,2)(2,2)(1,2)(0,2)

(3,0)(2,0)

Cartesian Topology (cont.)

• MPI_Cart_create(MPI_COMM_WORLD, ndim, dims, periods, reorder, new_comm);– periods indicates if the procs at the end are connected.

Useful for periodic domains.– periods[0] = periods[1] = 0;– reorder indicates whether allowing the system to optimize

the mapping of grid procs to the underlying physical procs by reordering.

Other Functions• MPI_Comm_rank(new_comm, my_new_rank)

– The ranking may have been changed by reordering.

• MPI_Cart_coords(new_comm, my_new_rank, ndim, coordiniates);– Given the rank, it returns the array coordinates

containing the coordinates of the procs.

• MPI_Cart_rank returns the rank when given the coordinates.

• MPI_Cart_get returns dims, periods, coords.• MPI_Cart_shift returns the coordinates of the

source & dest procs in a shift operation.

Communicators• Example: MPI_COMM_WORLD.• Communicator = group + context.• A communicator consists of:

– A group of procs.– A set of comm. channels between these procs.– Each communicator has its own set of channels.– Messages sent with one communicator cannot be received by

another.• Enable development of safe software libraries.

– Library uses private comm. domain.• Sometime, restricting comm. to a subgroup is useful, e.g.

broadcast messages across a row or down a column of grid procs.

Why Communicators ?• Conflicts among MPI calls by users and libraries.• E.g. Sub1 & Sub2 are from 2 different libraries.• Correct execution of library calls:

recv(any)

recv(1)

send(1)

send(2)

send(1)

send(0)

recv(0)

recv(2)

0 PP P

Why Communicators ? (cont.)

• Incorrect execution of library calls:

recv(any)

recv(1)

send(1)

send(2)

send(1)

send(0)

recv(0)

recv(2)

0 PP P

MPI Group

• A group = set of procs.• Create a group by:

– MPI_Group_incl: includes specific members.– MPI_Group_excl: excludes specific members.– MPI_Group_union: union two groups.– MPI_Group_intersection: intersect two groups.

• MPI_Comm_group: get an existing group.• MPI_Group_free: free a group.

• Extensions to MPI1.1, MPI1.2.

• Major topics being discussed:– Dynamic process management.

– Client/server.

– Real-time extensions.

– "One-sided" communications.

– Portable access to MPI state (for debuggers).

– Language bindings for C++ and Fortran 90.

CS 770G - Parallel Algorithms in Scientific Computing ...cs770g/handout/mpi2.pdfCS 770G - Parallel...

Documents

Parallel Multigrid a Scalable Alternative to Parallel CG? · • parallel multigrid –examples • parallel multigrid with Peano • conclusions Outline. Technische Universität

Parallel Database Systems - NUS Computingtankl/cs5225/2008/parallel...1 CS5225 Parallel DB 1 Parallel Database Systems CS5225 Parallel DB 2 Parallel DBMS • Uniprocessor technology

Parallel Performance Measurement of Heterogeneous Parallel

Designing Parallel Operating Systems via Parallel Programming

Parallel gripping Overview parallel gripper RPP

AEHQ6361-00, 770G Off-Highway Truck Specalog

CATERPILLAR MACHINE PRICE LIST 770G...TPMS REQUIRES: Download cable 305-5528 available from Cat Parts system. VIMS pc annual software subscription available through Cat Literature

CS 770G - Parallel Algorithms in Scientific Computing · 2001-05-25 · Scientific Computing May 28 , 2001 Lecture 6 Dense Matrix Computation II: Solving Linear Systems. 2 References

AEHQ6361-00, 770G Off-Highway Truck Specalog€¦ · Caterpillar uses the latest technology to validate the quality of its designs, metallurgy, welding and manufacturing processes

CS 770G - Parallel Algorithms in Scientific Computing Introductioncs770g/handout/... · 2001. 5. 7. · VP500 SX3. 12 More Statistics TOP500 11/00 Chip Technology Alpha Power HP intel

Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p

Parallel Systems Introductionparallel.vub.ac.be/education/parsys/notes2011/ParSys_Introduction.pdf · Parallel Systems: Introduction Parallel Systems Introduction Jan Lemeire Parallel

1 Chapter 2 Parallel Hardware and Parallel Software An Introduction to Parallel Programming

4.4 – Parallel & Perpendicular Lines. Parallel Lines

Parallel Processing and Multiprocessors Why Parallel ...ee565/slides/ch4.pdf · Parallel Processing and Multiprocessors why parallel processing? types of parallel processors ... Flynn

Series, Parallel, and Series-Parallel Circuits

SERIES, PARALLEL, AND SERIES-PARALLEL CIRCUITS

ThrougPuter Parallel Computing PaaS · 2020-01-14 · Existing parallel computing tools mainly limited to parallel programming aspect of parallel computing challenge Any parallel

Parallel Computing Explained Parallel Performance Analysis

08820 Ligamax Gold Rejuntamento Acrilico 770g FICHA ... · • Respeitar o tempo de acabamento do produto ... rejuntamento poderá aderir demasiadamente na borda da peça, ... Manter