56
Copyright © 2014 CeGP Parallel Computing Basics With a Case Study on the CeGP Cluster Dr. Tamir Hegazy Dr. Entao Liu Dr. Zhiling Long September 17, 2014

Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Parallel Computing Basics

With a Case Study on the CeGP Cluster 

Dr. Tamir HegazyDr. Entao LiuDr. Zhiling Long

September 17, 2014

Page 2: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Supercomputing & Geophysics

2

Page 3: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Seminar Goals

After this seminar, you should be able to:– Differentiate among 

• Parallel computing architectures• Parallel programming models

– Write parallel programs• Matlab• Message Passing Interface (MPI)• MatlabMPI/pMatlab

– Run your parallel programs on the CeGP cluster

3

Page 4: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Outline

• Parallel computing basics

• New CeGP cluster

• Matlab Parallel Processing Toolbox

• Message Passing Interface (MPI)

• MatlabMPI/pMatlab

4

Page 5: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Parallel Computing Basics

Page 6: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Definition

• Classic definition

A parallel computer is a “collection of processing elements

that communicate and cooperate to solve large problems 

fast.”

form

function

P

M I/O

PP

PP

M I/O

???

UniprocessorParallel processors 6

Page 7: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Coupling in Parallel SystemsTightly Coupled Loosely Coupled

Share/sync clock? Yes No

Share Bus? Yes No

Communication Faster Slower

Cost Higher Lower

Scalability Lower Higher

Energy Efficiency Higher Lower

Examples Multi‐cores Today’s Clusters

P

M I/O

P PP

M I/O

Uniprocessor

P

M I/O

P

M I/O

P

M I/O 7

Page 8: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Parallel Computing Paradigms

Shared Address Space Message Passing

Coupling Tighter Looser

Communication Through bus/memory Through network

Primitives read, write send, recv

• What’s the difference between parallel and distributed systems?8

Page 9: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Shared Address Space

P P

Switch/Network

M M

P

Switch/Network

M MP

Dancehall(UMA)

NUMA

• Versus shared memory• Common architectures

• User‐level operations: read/write or load/store

Process i

Private

Shared

Process j

Private

Shared

Private

Shared

Private

Physical memory

9

Page 10: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Message Passing

• Versus message passing for inter‐process 

communication

• Architecture: similar to NUMA

– Major difference: communication through I/O, not 

memory

• Architecture convergenceP

Switch/Network

M MP

NUMA 10

Page 11: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Flynn’s TaxonomyData Stream

Single Multiple

Instruction Stream Single

SISD    SIMD

Multip

le

MISD MIMD

Instruction 

pool

Data pool

P

Instruction 

pool

Data pool

PP

Instruction 

pool

Data pool

P

P Instruction 

pool

Data pool

P

P

Program S SPMD

M MPMD

Aka, Data parallel (vs. Task Parallel)

11

Page 12: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Tasks

Parallelization Steps

Decompo

sition

Assig

nmen

t p0 p1

p2 p3Orche

stratio

np0 p1

p2 p3 Mapping

Partition

ing

Sequential computation

Processes Parallel program

Parallel architecture

P0P1P2P3P4

12

Page 13: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Parallelizing a Once‐Sequential Program

Parallelizable

Inherently sequential

Sequential

Parallelized

TsTp

Speedup = Ts/Tp

13

Page 14: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Speedup Analysis (1)

comms

s

p

s

TTp

ss

TTTpS

1

Speedup

# processors

Parallel exec. time

Sequential exec. time

Parallel processing time

Effective commun. time

Inherently sequential  fraction

14

Page 15: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Speedup Analysis (2)• Amdahl’s Law

– Ignore communication time

• Amdahl’s limit

p

sspS

11Speedup

Inherently sequential fraction # processors

spSp

1

lim

0.50.20.10.010.001s25101001000Limit

15

Page 16: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Speedup Analysis (3)

• Degree of parallelism limit

• Communication limit

• Assume Tcomm(p)= f (p) Ts

• (f: communication-to-computation ratio)

• Assume s=0 (perfectly parallelizable),

10 ppSpE

comm.comp.1lim

pfpS

p

dppS useful

Degree of parallelism: number of parallel 

operations in a program

EfficiencyLinear speedup

16

Page 17: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

CeGP Cluster at GT

Page 18: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

CeGP Cluster at GT• Received in July 2014• Provider: PSSC Labs

– Based in California– Top provider for Cluster solution

• Three‐year warranty and support• Expansion plans

18

Page 19: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Cluster Specs at a Glance

• Physical cores: 44• Logical cores: 88 (hyperthreading)• Nodes: 3• Total Memory: 192 GB• Total Storage: 34 TB• Interconnect: Gigabit Ethernet• UPS for head node• OS: CentOS

19

Page 20: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Cluster Architecture Overview

12 (24 HT) Intel Xeon @ 2.1 GHz

64 GB RAM

Gb Ethernet NIC

16 (32 HT) Intel Xeon @ 2.6 GHz

64 GB RAM

Gb Ethernet NIC

64 GB RAM

Gb Ethernet NIC

16x 2 TB

Gb Ethernet Switch

1 TB1 TB

16 (32 HT) Intel Xeon @ 2.6 GHz

Head

 Nod

e

Compute Node 1 Compute Node 2

20

Page 21: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

MATLAB Parallel Toolbox

Page 22: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Parallel Processing/Computing on MATLAB

• Carry out multiple tasks simultaneously on different processors

• Speed up task‐parallel applications• MATLAB has well documented and convenient parallel processing toolbox

• Built in syntax abstracts the complexity involved in parallel computing

• Support usage of NVIDIA GPUs 

22

Page 23: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Basics in Parallel Processing 

• Parallel for loops (multi‐core CPUs)– parfor (Independent parameter sweep/Monte‐Carlo experiment)

– Distributed arrays (Matrix that are larger than memory limits on single computer)

• GPU Computing– CUDA‐enabled NVIDIA GPUs– FFT/FFT2 – A\b

23

Page 24: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Application speed up

Figure borrowed from www.mathworks.com24

Page 25: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

How to use parfor?

• Must create a pool of workerspool = parpool(8);

• Do parallel computinguse parfor loops

• Clean up by deleting pool of workers after finishdelete(gcp); 

25

Page 26: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

parfor

• Adding 1 to s can be done in any order• Condition p(i) is independent of s

26

Page 27: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Distributed Arrays/Matrices

27

Page 28: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

GPU Arrays

28

Page 29: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Benchmarking A\b on the GPU

29

Page 30: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Message Passing Interface (MPI)

Page 31: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

MPI at a Glance

• Not a language

• Standardized system, several implementations

• Dominates parallel programming and HPC

• Set of routines callable from C/C++, Fortran

• MPI 1.0: 1992

• MPI 2.0: 1997

• We focus on MPI 1.0 for C

High Performance Computing

31

Page 32: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

MPI Program Structure#include headers

int main(){

}

• Declare• General• MPI-related

• Initialize MPI

Work in parallel. . .

Finalize MPI

#include <stdio.h>#include <mpi.h>

int main(int argc, char **argv) {

}

. . .int myID, nProc; MPI_Status status;MPI_Init(&argc, &argv);

MPI_Comm_size(MPI_COMM_WORLD, &nProc);MPI_Comm_rank(MPI_COMM_WORLD, &myID);

Work in parallel. . .

MPI_Finalize();

32

Page 33: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Specify boundaries

Simple Example:

Work in parallel

Collect results

Display results

start = (1000000*myID/nProc)+1;

end = 1000000*(myID+1)/nProc;

for(i=start; i<=end; i++) sum = sum + i;

if(myID == 0)

for(i=1; i<nProc; i++)

{

MPI_Recv(&accum, 1, MPI_INT, i, 1, MPI_COMM_WORLD, &status);

sum = sum + accum;

}

else

MPI_Send(&sum, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);

if(myID == 0)

printf(“Sum from 1 to 1000000 is: %d\n", sum );33

Page 34: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Compile and Run MPI

mpicc -o sum1e6.o sum1e6.cmpirun –np 10 sum1e6.o

Sum from 1 to 1000000 is: 1784293664

Output (binary) file Source file

# processes

34

Page 35: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Example: Matrix Multiplication

x=

A B C

35

Page 36: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Matrix Multiplication: Data Partitioning

x =

x =

x =

P0

P1

P2

36

Page 37: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

MPI_Scatter

MPI_Scatter(A, //send buffer4, //send countMPI_FLOAT, //send data typeA_row,  //receive buffer4, //receive countMPI_FLOAT, //receive data type0, //source process idMPI_COMM_WORLD //comm. handle);

AP0

P0

P1

P2

A_row

A_row

A_row

37

Page 38: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

MPI_Bcast

MPI_Bcast (BB, //send buffer20, //send countMPI_FLOAT, //send data typeB,  //receive buffer0, //source process idMPI_COMM_WORLD //comm. handle);

P0

BBP0

P1

P2

B

B

B

38

Page 39: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

MPI_Gather

MPI_Gather(C, //send buffer5, //send countMPI_FLOAT, //send data typeC_row,  //receive buffer5, //receive countMPI_FLOAT, //receive data type0, //source process idMPI_COMM_WORLD //comm. handle);

CP0

P0

P1

P2

C_row

C_row

C_row

39

Page 40: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

More MPI Collective Calls (1)

MPI_Gather + MPI_Bcast

Diagrams from mpitutorial.com40

Page 41: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

More MPI Collective Calls (2)MPI_MAXMPI_MINMPI_PRODMPI_LANDMPI_LORMPI_BANDMPI_BORMPI_MAXLOCMPI_MINLOC

MPI_Reduce + MPI_Bcast

Diagrams from mpitutorial.com41

Page 42: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

MPI Collective Calls (3)

P0

P1

P2

P0

P1

P2

MPI_Alltoall()

42

Page 43: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Blocking vs. Non‐blocking Calls• Order of MPI_Send, MPI_Recv could cause deadlocks

• There are equivalent nonblocking calls: MPI_Isend, MPI_Irecv

43

Page 44: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

MatlabMPI

Page 45: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

What is MatlabMPI

• A Matlab implementation of a subset of the Message 

Passing Interface (MPI) standard, developed at MIT 

Lincoln Lab.

• An extremely compact (~200 lines) implementation 

on top of standard Matlab file I/O.

• Can match the bandwidth of C based MPI at large 

message sizes.

45

Page 46: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Installation/Setup• Download at 

http://www.ll.mit.edu/mission/cybersec/softwaretools/matlabmpi/matl

abmpi.html

• PC: Add to startup.m (usually at $matlabroot/toolbox/local)

– “addpath MatlabMPIInstallationFolder\src”

– “addpath ProgramLaunchingFolder\MatMPI”

• Linux: Add to startup.m

– “addpath /home/username/MatlabMPI/src”

• Source file MPI_Probe.m needs to be modified:

– Line 42:  [pathstr, name, ext, versn] = fileparts(file_name); remove 46

Page 47: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Principles

Principles of MatlabMPI: Implementatioin of Basic MPI Communications 

47

Page 48: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Functions: Core

MPI_Run.mMPI_Init.m

MPI_Finalize.m

MPI_Comm_size.mMPI_Comm_rank.m

MPI_Send.mMPI_Recv.m

MPI_Abort.mMPI_Bcast.mMPI_Probe.mMPI_cc.m

• The core library implements the basic MPI operations.

48

Page 49: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Functions: Utility

MatMPI_Comm_dir.mMatMPI_Save_messages.m

MatMPI_Delete_all.mMatMPI_Comm_settings.m

• The utility library implements auxiliary functions besides the MPI core.

MatMPI_Buffer_file.mMatMPI_Lock_file.mMatMPI_Commands.mMatMPI_Comm_init.mMatMPI_mcc_wrappers

49

Page 50: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Example: Image Convolution

Padding must be done using actual pixel data from neighboring sub_images to avoid error.

Image A Padded Image Convolves w/ Kernel B

Flipped Kernel (–B)

Convolved Image C

Serial Implementation: C = conv2(A, B, ‘same’);

Image A Splits into sub_image Pads into work_image Convolves w/ Kernel B, each in serial manner

Merge into Convolved Image

Padding needed around the borders, taken care of by Matlab.

Parallel Implementation

50

Page 51: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Code: Skeleton% General initialization.MPI_Init; % Initialize MPI.comm = MPI_COMM_WORLD; % Create communicator.comm_size = MPI_Comm_size(comm); % Get size.my_rank = MPI_Comm_rank(comm); % Get rank.

% Prepare kernel(nFX, nFY).% Prepare image(nX, nY).% Split into sub_image(nSX, nSY) for each processor.% Prepare work_image based on sub_image with appropriate padding % (as shown on the next slide).

% Convolves each work_image with kernel.work_image = conv2(work_image, kernel, 'same');% Extract convolved results.sub_image = work_image(nFX/2+1:nFX/2+nSX, 1:nSY);

% Finalize MatlabMPI.MPI_Finalize;

% Host processor collects sub_images and merges into the final image.% End of program

51

Page 52: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Code: Padding% Find out ranks for left and right processors.left = my_rank - 1; if (left < 0), left = comm_size - 1; endright = my_rank + 1;if (right >= comm_size), right = 0; endltag = 1; rtag = 2; % Create message tags.

% Extract from left/right side of sub_image% and send to left/right processor.l_sub_image = sub_image(1:nFX/2, 1:nSY); MPI_Send(left, ltag, comm, l_sub_image); r_sub_image = sub_image(nSX-nFX/2+1:nSX, 1:nSY);MPI_Send(right, rtag, comm, r_sub_image);

% Prepare work_image(nSX+nFX, nSY).work_image = zeros(nSX+nFX, nSY); work_image(nFX/2+1:nFX/2+nSX, 1:nSY) = sub_image; % Copy sub_image into central part.r_pad = MPI_Recv(right, ltag, comm); % Receive right padding from right processor.work_image(nFX/2+nSX+1:nSX+nFX, 1:nSY) = r_pad; % Put into right part of work_image.l_pad = MPI_Recv(left, rtag, comm); % Receive left padding from left processor.work_image(1:nFX/2, 1:nSY) = l_pad; % Put into left part of work_image.

sub_image(left)

sub_image(self)

sub_image(right)

work_image (self)

r_padl_pad

MPI_Send/RecvMPI_Send/Recv

52

Page 53: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

How to Run

% Abort left over jobs.MPI_Abort;pause(2.0);

% Delete left over MPI directory.MatMPI_Delete_all;pause(2.0);

% Define machines; empty means run locally.machines = {};

% Define machines.%machines = {‘machineA:/directoryA' ...% 'machineB:/directoryB'};

% Run scripts.eval(MPI_Run(‘convolveImage', 2, machines));

An Example Script (RUN.m from the package).

• System Requirements:– Shared memory systems 

require a single Matlablicense; distributed memory systems require one Matlab license per machine.

– A directory visible to every machine (defaults to the launching directory but can be changed).

53

Page 54: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Comparison with MPI

• File I/O is less efficient.

• More desirable if many other Matlab

toolboxes may be utilized for the application.

54

Page 55: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

pMatlab

• Newer package built upon MatlabMPI.

• Hides message passing from programmer.

• Utilizes a global array library, implemented using 

MatlabMPI.

• Interested? Check it out at 

http://www.ll.mit.edu/mission/cybersec/softwar

etools/pmatlab/pmatlab.html

55

Page 56: Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Questions?

Thank You!

56