Distributed Processing Systems (InterProcess Communication) (Message Passing) Distributed Processing Systems (InterProcess Communication) (Message Passing)

Distributed Processing SystemDistributed Processing Systemss

(InterProcess Communication)(InterProcess Communication)((Message PassingMessage Passing))

오 상 규오 상 규

서강대학교 정보통신 대학원서강대학교 정보통신 대학원

Email : [email protected] : [email protected]

Page 2

InterProcess Communication - Message Passing

서강대학교 정보통신 대학원

What is Message Passing ?What is Message Passing ?

Data transfer + Synchronization

TIME

DATA

Process 0May I send ?

Process 1

Yes !

DATA

DATA

DATA

DATA

DATA

Requires cooperation of sender & receiver.

Page 3



Characteristics of Message PassingCharacteristics of Message Passing Multiple Threads of Control

Consists of multiple processes, each of which has its own control and may execute different code. Supports MIMD or SPMD parallelism.

Asynchronous Parallelism Message Passing program executes asynchronously. Need barrier and

blocking communication for synchronization. Separate Address Space

Data variables in one process are not visible to other processes. Need special library routines (e.g., send/receive) to interact with other processes.

Explicit Interactions The Programmer must resolve all the interaction issues such as

communication and synchronization. Explicit Allocation

Data should be explicitly allocated by the user.

Page 4



Message Passing LibrariesMessage Passing Libraries Proprietary Software

CMMD : Message passing library used in Thinking Machines CM-5. Express : Programming environment by Parasoft Corporation for message passin

g and parallel I/O. Nx : Microkernel system developed for Intel MPPs (e.g., Paragon). Replaced by a

new kernel called PUMA. Public-Domain Software

p4 : A set of macros and subroutines used for programming both shared-memory and message passing systems.

PARMACS : Message passing package derived from p4 and mainly used in Europe.

PVM and MPI MPI : A standard specification for a library of message passing functions develope

d by the MPI Forum. PVM : Self-contained, public domain software system to run parallel applications o

n a network of heterogeneous workstations.

Page 5



Classification of Message Passing LibrariesClassification of Message Passing Libraries Application Domain

General Purpose : p4, PVM, MPI, Express, PARMACS, etc. ISIS, Horus, Totem, Transis for reliable group communication.

Application Specific : BLACS (for linear algebra), TCGMSG (for chemistry), etc.

Programming Model Computation Model : data parallel or functional parallel. Communication Model : RPC, message passing, or shared memory.

Underlying Implementation Philosophy Socket for portability. High performance communication middleware (e.g., Active Message or Fast

Message) to achieve high performance. Portability

CMMD for CM-5 and NX/2 for Intel parallel computers. Heterogeneity

Page 6



High-Performance Message-Passing SchemesHigh-Performance Message-Passing Schemes

High-Performance Message-Passing Schemes

HW-Based Approach SW-Based Approach

MultithreadingHigh

PerformanceAPI

HybridApproachMiddleware

Standard Proprietary

(e.g. Fast Sockets) (e.g. U-Net, Active Message (AM), Fast Message (FM))

(e.g. TPVM, LPVM, Chant) (e.g. MPI-FM, PVM-ATM) (e.g. MPI-Nexus, Panda-PVM)

(e.g. Nectar, PAPER, SHRIMP, ParaStation, Memory Channel)

Page 7



Communication Modes in Message Passing Communication Modes in Message Passing

Synchronous Message Passing Blocking Send/Receive Non-Blocking Send/Receive

Process P Process Q

M = 10; S = -100;

L1: send M to Q; L1: receive S from P;

L2: M = 20; L2: X = S+1;

goto L1;

Variable M is often called the send message buffer and S is called the receive message buffer.

Page 8



Three Communication ModesThree Communication Modes Synchronous Message Passing

P has to wait until Q execute a corresponding Receive. Will not return until M is sent and received. No additional buffer needed. X should be evaluated to 11.

Blocking Send/Receive Send is executed when a process reaches it without waiting for a corresponding Receive. Send does not return until the message is sent, meaning that message variable M can be

safely rewritten. Maybe temporarily buffered in the sending node, somewhere in the network, or in the receiving node.

Receive is executed when a process reaches it without waiting for a corresponding Send. Receive does not return until the message is received. X should be evaluated to 11.

Non-Blocking Send/Receive Send is executed when a process reaches it without waiting for a corresponding Receive. Send returns immediately after it notifies the system. Unsafe to overwrite M. Receive is executed when a process reaches it without waiting for a corresponding Send. Receive return immediately regardless of the message arrival. X can be 11, 21, or -99.

Page 9



Comparison of Communication ModesComparison of Communication Modes

SynchronousCommunication Event Non-BlockingBlocking

Send Start Condition

Return of Send Indicates

Semantics

Buffering Message

Status Checking

Wait Overhead

Overlapping Communications and Computations

Both send and receive reached

Message received

Clean

Not needed

Not needed

Highest

No

Send reached

Message sent

In-Between

Needed

Not needed

In-Between

Yes

Send reached

Message send initiated

Error-Prone

Needed

Needed

Lowest

Yes

Page 10



What is MPI ?What is MPI ? MPI : Message Passing InterfaceMessage Passing Interface

Developed in 1993-1994 by MPI-Forum.

A message-passing library specification

Can be used in C, FORTRAN, and C++ program. (comprises 129 functions and macros.)

Not a language or compiler specification.

Not a specific implementation or product.

Standards for programming parallel computers, clusters, and heterogeneous networks.

Page 11



Reasons for using MPIReasons for using MPI

Standardization The only message passing library which can be considered a standard.

Portability

No need to modify your source code when you port your application to a different platform.

Performance Vendor implementations should be able to exploit native hardware features to optimize performance.

Availability A variety of implementations are available.

Page 12



CommunicatorCommunicator A subset of processes as “communication”universe. composed of Group : an ordered collection of processes. Context : a system defined tag that is attached to a group.

Communicator

PROCESS 0 PROCESS 1 PROCESS n . . .

- identifying process subsets during development of modular programs.- ensuring that messages intended for different purposes are not confused.

each process is assigned a unique rank.(non negative int processor I.D.)

Page 13



Types of CommunicatorsTypes of Communicators

Intra-Communicators

A collection of processes that can send messages to each other and engage in a collective communication operations.

ex) MPI_COMM_WORLD (default)

Inter-Communicators

Used for sending messages between processes belonging to disjoint intra-communicators.

ex) a newly created intra-communicator could be linked to the original intra-communicator by an inter-communicator.

Page 14



MPI Communication ModelMPI Communication Model

Point to point communication operations

Send a message from one named process to another. Used to implement local and unstructured communications.

Collective communication operations

Perform commonly used global operations such as summation and broadcast.

Page 15



MPI Data typeMPI Data type

MPI_PACKED

MPI_BYTE

long double MPI_LONG_DOUBLE

double MPI_DOUBLE

float MPI_FLOAT

unsigned long int MPI_UNSIGNED_LONG

unsigned int MPI_UNSIGNED

unsigned short int MPI_UNSIGNED_SHORT

unsigned char MPI_UNSIGNED_CHAR

signed long int MPI_LONG

signed int MPI_INT

signed short int MPI_ SHORT

signed char MPI_CHAR

C DATA TYPEMPI DATA TYPE

Page 16



MPI Basic functionsMPI Basic functions MPI_INIT(int *argc, char ***argv) : initiate an MPI computation

MPI_FINALIZE() : terminate a computation MPI_COMM_SIZE(IN comm, OUT size) : determine number of processes

MPI_Comm comm : communicator handle int size : number of processes in the group of comm

MPI_ COMM_RANK(IN comm, OUT pid) : determine my process identifier

MPI_Comm comm : communicator handle int pid : process id in the group of comm

Cf. IN : Call by Value, OUT : as Return Value, INOUT : Call by Reference

Page 17



Simple MPI ExampleSimple MPI Example

shutdown

main routine

my process id

initialize

number of process

MPI header#include “mpi.h”

main ( int argc, char *argv[ ] ){ . . . /* No MPI functions called before this */ Ierr = MPI_Init ( &argc, &argv ) ; . . . MPI_Comm_size ( MPI_COMM_WORLD, &np ) ;

MPI_Comm_rank ( MPI_COMM_WORLD, &myid ) ; . . . if ( myid ! = 0 ) MPI_Send ( buff, 300, MPI_FLOAT, 0, 0, MPI_COMM_WORLD ) ; else MPI_Recv ( buff, 300, MPI_FLOAT, srcid, 0, MPI_COMM_WORLD, &status ) ; . . . MPI_Finalize ( ) ; /* No MPI functions called after this */}

Page 18



MPI MessageMPI Message

MPI Message : Data + Envelope

MPI MESSAGE DATA ENVELOPE

Receiver Rank

A Tag (user specified)

Sender Rank

A Communicator

used to distinguish messages received from a single process.

mechanisms for grouping data items - count parameter - derived datatypes - MPI_Pack / MPI_Unpack

Page 19



COMMUNICATOR

MPI Point to Point CommunicationMPI Point to Point Communication

SEND ( )

BLOCKING COMMUNICATION OR NON-BLOCKING COMMUNICATION

PROCESS APROCESS A

RECV ( )PROCESS BPROCESS B

Page 20



MPI Send / Receive function PrototypeMPI Send / Receive function Prototype MPI_SEND(IN msg, IN count, IN datatype, IN dest, IN tag, IN comm) : send a message

void *msg : address of send buffer int count : number of elements to send ( 0 ) MPI_Datatype datatype : data type of send buffer elements int dest : process id of destination process int tag : message tag MPI_Comm comm : communicator handle

MPI_RECV(OUT msg, IN count, IN datatype, IN source, IN tag, IN comm, OUT status ) : receive a message

void *msg : address of receive buffer int count : number of elements to receive ( 0 ) MPI_Datatype datatype : data type of receive buffer elements int dest : process id of source process, or MPI_ANY_TAG int tag : message tag or MPI_ANY_TAG MPI_Comm comm : communicator handle MPI_Status *status : status object

Page 21



Blocking Message Passing ExampleBlocking Message Passing Example #include “mpi.h”#include <stdio.h>

main ( int argc, char *argv[ ] ){ int numtasks, rank, dest, source, rc, tag = 1 ; char inmsg, outmsg = ‘x’ ; MPI_Status Stat ;

MPI_Init ( &argc, &argv ) ; MPI_Comm_size ( MPI_COMM_WORLD, &numtasks ) ; MPI_Comm_rank ( MPI_COMM_WORLD, &rank ) ;

if ( rank == 0 ) { dest = 1 ; source = 1 ; rc = MPI_Send ( &outmsg , 1 , MPI_CHAR , dest , tag , MPI_COMM_WORLD ) ; rc = MPI_Recv ( &inmsg , 1 , MPI_CHAR , source , tag , MPI_COMM_WORLD , &Stat ) ; } else if ( rank == 1 ) { dest = 0 ; source = 0 ; rc = MPI_Recv ( &inmsg , 1 , MPI_CHAR , source , tag , MPI_COMM_WORLD , &Stat ) ; rc = MPI_Send ( &outmsg , 1 , MPI_CHAR , dest , tag , MPI_COMM_WORLD ) ; } MPI_Finalize ( ) ;}

Page 22



Non-Blocking Message Passing ExampleNon-Blocking Message Passing Example #include “mpi.h”#include <stdio.h>

main ( int argc, char *argv[ ] ){ int numtasks, rank, next, prev, buf[2], tag1 = 1, tag2 = 2 ; MPI_Request reqs[4] ; MPI_Status stats[4] ;

MPI_Init ( &argc, &argv ) ; MPI_Comm_size ( MPI_COMM_WORLD, & numtasks ) ; MPI_Comm_rank ( MPI_COMM_WORLD, &rank ) ;

prev = rank – 1 ; next = rank + 1 ;

if ( rank == 0 ) prev = numtasks – 1 ; if ( rank == ( numtasks – 1) ) next = 0 ;

MPI_Irecv ( &buf[0] , 1 , MPI_INT , prev , tag1 , MPI_COMM_WORLD , reqs[0] ) ; MPI_Irecv ( &buf[1] , 1 , MPI_INT , next , tag2 , MPI_COMM_WORLD , reqs[1] ) ;

MPI_Isend ( &rank , 1 , MPI_INT , prev , tag2 , MPI_COMM_WORLD , reqs[2] ) ; MPI_Isend ( &rank , 1 , MPI_INT , next , tag1 , MPI_COMM_WORLD , reqs[3] ) ;

MPI_Waitall ( 4 , reqs , stats ) ;

MPI_Finalize ( ) ;}

Page 23



MPI Collective CommunicationMPI Collective Communication

A communication pattern that involves all the processes in a communicator.

Tree structured communication

P0

P2

P2 P3

P0

P0 P1

P6

P6 P7

P4

P4 P5

P4

P0

If we have p processes, this procedure allows us to distribute the input data In log2(p) stages, rather than p-1 stages, which, if p is large, is a huge savings.

Page 24



Barrier and Broadcast Operations Barrier and Broadcast Operations MPI_BARRIER(IN comm) : Synchronizes all processes.

MPI_BCAST(INOUT inbuf, IN incnt, IN intype, IN root, IN comm) : Sends data from one process to all processes.

A0DATA

PROCESSES A0

A0

A0

MPI_BCAST

Page 25



Gather and Scatter Operations Gather and Scatter Operations MPI_GATHER ( IN inbuf, IN incnt, IN intype, OUT outbuf, IN outcnt, IN outtype, IN root, IN comm ) : Gathers data from all processes to one process.

A0

A1

A3

DATAPROCESSES A0 A1 A3

MPI_GATHER

MPI_SCATTER ( IN inbuf, IN incnt, IN intype, OUT outbuf, IN outcnt, IN outtype, IN root, IN comm ) : Scatters data from one process to all processes.

A0 A1 A2DATA

PROCESSES A0

A1

A2

MPI_SCATTER

Page 26



Reduce Operation (1) Reduce Operation (1) MPI_REDUCE ( IN inbuf, OUT outbuf, IN cnt, IN type, IN op, IN root, IN comm ) : combine the values to the output buffer of the single root process using a specified operation.

2 4

PROCESS 0

Initial Data :

5 7

PROCESS 1

0 3

PROCESS 2

6 2

PROCESS 3

- -

13 16

- -

- -

MPI_REDUCE with MPI_SUM, root = 1

Page 27



Reduce Operation (2)Reduce Operation (2) MPI_ALLREDUCE ( IN inbuf, OUT outbuf, IN cnt, IN type, IN op, IN comm ) : combine the values to the output buffer of all processes using a specified operation.

2 4

PROCESS 0

Initial Data :

5 7

PROCESS 1

0 3

PROCESS 2

6 2

PROCESS 3

0 2

0 2

0 2

0 2

MPI_ALLREDUCE with MPI_MIN

Page 28



Collective Communication ExampleCollective Communication Example

#include “mpi.h”#include <stdio.h>#define SIZE 4

main ( int argc, char *argv[ ] ){ int numtasks, rank, sendcount, recvcount, source ; float sbuf[SIZE][SIZE] = { { 1.0 , 2.0 , 3.0 , 4.0 } , { 5.0 , 6.0 , 7.0 , 8.0 } , { 9.0 , 10.0 , 11.0 , 12.0 } , { 13.0 , 14.0 , 15.0 , 16.0 } } ; float rbuf[SIZE] ;

MPI_Init ( &argc, &argv ) ; MPI_Comm_rank ( MPI_COMM_WORLD, &rank ) ; MPI_Comm_size ( MPI_COMM_WORLD, &numtasks ) ;

if ( numtasks == SIZE ) { source = 1 ; sendcount = SIZE ; recvcount = SIZE ;

MPI_Scatter ( sbuf , sendcount , MPI_FLOAT , rbuf , recvcount , MPI_FLOAT , source , MPI_COMM_WORLD ) ; printf ( “rank=%d results : %f %f %f %f \n”, rank, rbuf[0], rbuf[1], rbuf[2], rbuf[3] ) ; } else printf ( “Must specify %d processors. Terminating. \n” , SIZE ) ;

MPI_Finalize ( ) ;}

Page 29



MPI Implementation (1)MPI Implementation (1) MPICH

Freely available implementation of the MPI standard, designed to be both portable and efficient.

Developed in 1996 by MPI-Forum.

to compile the C source program prog.c

% cc -o prog.c -I/usr/local/mpi/include -L/usr/local/mpi/lib -lmpi

to run the program with 4 processes

% mpirun -np 4 prog

Page 30



MPI Implementation (2)MPI Implementation (2)

LAM

Available from Ohio Supercomputer center and runs on heterogeneous network of Sun, DEC, SGI, HP workstations.

CHIMP - MPI

Available from the Edinbourgh Parallel Computing Center and runs on Sun, DEC, SGI, IBM, HP workstations, the Meiko Computing Surface machines, and the Fujitsu AP-1000.

Page 31



MPI Interaction ArchitectureMPI Interaction Architecture

MPICH MPI (Message Passing Interface) : Machine independent layer

AM (Active Messages) , FM (Fast Message), etc.

ADI (Abstract Device Interface) : Machine specific layer - provides efficient communication primitives - optimizations to ADI and the higher layers of MPICH

Page 32



MPI 2 (1)MPI 2 (1)

Enhanced MPI

Discussed in 1995 by MPI-Forum. Draft made available in 1996

New functionality

Dynamic processes : extensions which remove the static process model of MPI. ( e.g. , MPI_SPAWN )

One sided communications : Include shared memory operations (put/get) and remote accumulate operations. ( e.g. , MPI_PUT )

Page 33



MPI 2 (2)MPI 2 (2) Parallel I / O

: Discusses MPI support for parallel I/O. (MPI-IO) I/O can also be modeled as message passing.

- Writing to a file : sending a message- Reading from a file : receiving a message

Extended collective operations : Allows for non-blocking collective operations and application of collective operations to inter-communicators.

External Interfaces : Defines routines which allow developers to layer on top of MPI, such as for debuggers and profilers.

Additional language bindings : Describes C++ bindings and discusses FORTRAN-90 issues.

Documents

Distributed Processing Systems (InterProcess Communication) (Message Passing) Distributed Processing Systems (InterProcess Communication) (Message Passing)