Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel

Copyright © 2014 CeGP

Parallel Computing Basics

With a Case Study on the CeGP Cluster

Dr. Tamir HegazyDr. Entao LiuDr. Zhiling Long

September 17, 2014


Supercomputing & Geophysics

2


Seminar Goals

After this seminar, you should be able to:– Differentiate among

• Parallel computing architectures• Parallel programming models

– Write parallel programs• Matlab• Message Passing Interface (MPI)• MatlabMPI/pMatlab

– Run your parallel programs on the CeGP cluster

3


Outline

• Parallel computing basics

• New CeGP cluster

• Matlab Parallel Processing Toolbox

• Message Passing Interface (MPI)

• MatlabMPI/pMatlab

4


Parallel Computing Basics


Definition

• Classic definition

A parallel computer is a “collection of processing elements

that communicate and cooperate to solve large problems

fast.”

form

function

P

M I/O

PP

PP

M I/O

???

UniprocessorParallel processors 6


Coupling in Parallel SystemsTightly Coupled Loosely Coupled

Share/sync clock? Yes No

Share Bus? Yes No

Communication Faster Slower

Cost Higher Lower

Scalability Lower Higher

Energy Efficiency Higher Lower

Examples Multi‐cores Today’s Clusters

P

M I/O

P PP

M I/O

Uniprocessor

P

M I/O

P

M I/O

P

M I/O 7


Parallel Computing Paradigms

Shared Address Space Message Passing

Coupling Tighter Looser

Communication Through bus/memory Through network

Primitives read, write send, recv

• What’s the difference between parallel and distributed systems?8


Shared Address Space

P P

Switch/Network

M M

P

Switch/Network

M MP

Dancehall(UMA)

NUMA

• Versus shared memory• Common architectures

• User‐level operations: read/write or load/store

Process i

Private

Shared

Process j

Private

Shared

Private

Shared

Private

Physical memory

9


Message Passing

• Versus message passing for inter‐process

communication

• Architecture: similar to NUMA

– Major difference: communication through I/O, not

memory

• Architecture convergenceP

Switch/Network

M MP

NUMA 10


Flynn’s TaxonomyData Stream

Single Multiple

Instruction Stream Single

SISD SIMD

Multip

le

MISD MIMD

Instruction

pool

Data pool

P

Instruction

pool

Data pool

PP

Instruction

pool

Data pool

P

P Instruction

pool

Data pool

P

P

Program S SPMD

M MPMD

Aka, Data parallel (vs. Task Parallel)

11


Tasks

Parallelization Steps

Decompo

sition

Assig

nmen

t p0 p1

p2 p3Orche

stratio

np0 p1

p2 p3 Mapping

Partition

ing

Sequential computation

Processes Parallel program

Parallel architecture

P0P1P2P3P4

12


Parallelizing a Once‐Sequential Program

Parallelizable

Inherently sequential

Sequential

Parallelized

TsTp

Speedup = Ts/Tp

13


Speedup Analysis (1)

comms

s

p

s

TTp

ss

TTTpS

1

Speedup

# processors

Parallel exec. time

Sequential exec. time

Parallel processing time

Effective commun. time

Inherently sequential fraction

14


Speedup Analysis (2)• Amdahl’s Law

– Ignore communication time

• Amdahl’s limit

p

sspS

11Speedup

Inherently sequential fraction # processors

spSp

1

lim

0.50.20.10.010.001s25101001000Limit

15


Speedup Analysis (3)

• Degree of parallelism limit

• Communication limit

• Assume Tcomm(p)= f (p) Ts

• (f: communication-to-computation ratio)

• Assume s=0 (perfectly parallelizable),

10 ppSpE

comm.comp.1lim

pfpS

p

dppS useful

Degree of parallelism: number of parallel

operations in a program

EfficiencyLinear speedup

16


CeGP Cluster at GT


CeGP Cluster at GT• Received in July 2014• Provider: PSSC Labs

– Based in California– Top provider for Cluster solution

• Three‐year warranty and support• Expansion plans

18


Cluster Specs at a Glance

• Physical cores: 44• Logical cores: 88 (hyperthreading)• Nodes: 3• Total Memory: 192 GB• Total Storage: 34 TB• Interconnect: Gigabit Ethernet• UPS for head node• OS: CentOS

19


Cluster Architecture Overview

12 (24 HT) Intel Xeon @ 2.1 GHz

64 GB RAM

Gb Ethernet NIC


64 GB RAM

Gb Ethernet NIC

64 GB RAM

Gb Ethernet NIC

16x 2 TB

Gb Ethernet Switch

1 TB1 TB


Head

Nod

e

Compute Node 1 Compute Node 2

20


MATLAB Parallel Toolbox


Parallel Processing/Computing on MATLAB

• Carry out multiple tasks simultaneously on different processors

• Speed up task‐parallel applications• MATLAB has well documented and convenient parallel processing toolbox

• Built in syntax abstracts the complexity involved in parallel computing

• Support usage of NVIDIA GPUs

22


Basics in Parallel Processing

• Parallel for loops (multi‐core CPUs)– parfor (Independent parameter sweep/Monte‐Carlo experiment)

– Distributed arrays (Matrix that are larger than memory limits on single computer)

• GPU Computing– CUDA‐enabled NVIDIA GPUs– FFT/FFT2 – A\b

23


Application speed up

Figure borrowed from www.mathworks.com24


How to use parfor?

• Must create a pool of workerspool = parpool(8);

• Do parallel computinguse parfor loops

• Clean up by deleting pool of workers after finishdelete(gcp);

25


parfor

• Adding 1 to s can be done in any order• Condition p(i) is independent of s

26


Distributed Arrays/Matrices

27


GPU Arrays

28


Benchmarking A\b on the GPU

29


Message Passing Interface (MPI)


MPI at a Glance

• Not a language

• Standardized system, several implementations

• Dominates parallel programming and HPC

• Set of routines callable from C/C++, Fortran

• MPI 1.0: 1992

• MPI 2.0: 1997

• We focus on MPI 1.0 for C

High Performance Computing

31


MPI Program Structure#include headers

int main(){

}

• Declare• General• MPI-related

• Initialize MPI

Work in parallel. . .

Finalize MPI

#include <stdio.h>#include <mpi.h>

int main(int argc, char **argv) {

}

. . .int myID, nProc; MPI_Status status;MPI_Init(&argc, &argv);

MPI_Comm_size(MPI_COMM_WORLD, &nProc);MPI_Comm_rank(MPI_COMM_WORLD, &myID);

Work in parallel. . .

MPI_Finalize();

32


Specify boundaries

Simple Example:

Work in parallel

Collect results

Display results

start = (1000000*myID/nProc)+1;

end = 1000000*(myID+1)/nProc;

for(i=start; i<=end; i++) sum = sum + i;

if(myID == 0)

for(i=1; i<nProc; i++)

{

MPI_Recv(&accum, 1, MPI_INT, i, 1, MPI_COMM_WORLD, &status);

sum = sum + accum;

}

else

MPI_Send(&sum, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);

if(myID == 0)

printf(“Sum from 1 to 1000000 is: %d\n", sum );33


Compile and Run MPI

mpicc -o sum1e6.o sum1e6.cmpirun –np 10 sum1e6.o

Sum from 1 to 1000000 is: 1784293664

Output (binary) file Source file

# processes

34


Example: Matrix Multiplication

x=

A B C

35


Matrix Multiplication: Data Partitioning

x =

x =

x =

P0

P1

P2

36


MPI_Scatter

MPI_Scatter(A, //send buffer4, //send countMPI_FLOAT, //send data typeA_row, //receive buffer4, //receive countMPI_FLOAT, //receive data type0, //source process idMPI_COMM_WORLD //comm. handle);

AP0

P0

P1

P2

A_row

A_row

A_row

37


MPI_Bcast

MPI_Bcast (BB, //send buffer20, //send countMPI_FLOAT, //send data typeB, //receive buffer0, //source process idMPI_COMM_WORLD //comm. handle);

P0

BBP0

P1

P2

B

B

B

38


MPI_Gather

MPI_Gather(C, //send buffer5, //send countMPI_FLOAT, //send data typeC_row, //receive buffer5, //receive countMPI_FLOAT, //receive data type0, //source process idMPI_COMM_WORLD //comm. handle);

CP0

P0

P1

P2

C_row

C_row

C_row

39


More MPI Collective Calls (1)

MPI_Gather + MPI_Bcast

Diagrams from mpitutorial.com40


More MPI Collective Calls (2)MPI_MAXMPI_MINMPI_PRODMPI_LANDMPI_LORMPI_BANDMPI_BORMPI_MAXLOCMPI_MINLOC

MPI_Reduce + MPI_Bcast

Diagrams from mpitutorial.com41


MPI Collective Calls (3)

P0

P1

P2

P0

P1

P2

MPI_Alltoall()

42


Blocking vs. Non‐blocking Calls• Order of MPI_Send, MPI_Recv could cause deadlocks

• There are equivalent nonblocking calls: MPI_Isend, MPI_Irecv

43

MatlabMPI

What is MatlabMPI

• A Matlab implementation of a subset of the Message

Passing Interface (MPI) standard, developed at MIT

Lincoln Lab.

• An extremely compact (~200 lines) implementation

on top of standard Matlab file I/O.

• Can match the bandwidth of C based MPI at large

message sizes.

45

Installation/Setup• Download at

http://www.ll.mit.edu/mission/cybersec/softwaretools/matlabmpi/matl

abmpi.html

• PC: Add to startup.m (usually at $matlabroot/toolbox/local)

– “addpath MatlabMPIInstallationFolder\src”

– “addpath ProgramLaunchingFolder\MatMPI”

• Linux: Add to startup.m

– “addpath /home/username/MatlabMPI/src”

• Source file MPI_Probe.m needs to be modified:

– Line 42: [pathstr, name, ext, versn] = fileparts(file_name); remove 46

Principles

Principles of MatlabMPI: Implementatioin of Basic MPI Communications

47

Functions: Core

MPI_Run.mMPI_Init.m

MPI_Finalize.m

MPI_Comm_size.mMPI_Comm_rank.m

MPI_Send.mMPI_Recv.m

MPI_Abort.mMPI_Bcast.mMPI_Probe.mMPI_cc.m

• The core library implements the basic MPI operations.

48

Functions: Utility

MatMPI_Comm_dir.mMatMPI_Save_messages.m

MatMPI_Delete_all.mMatMPI_Comm_settings.m

• The utility library implements auxiliary functions besides the MPI core.

MatMPI_Buffer_file.mMatMPI_Lock_file.mMatMPI_Commands.mMatMPI_Comm_init.mMatMPI_mcc_wrappers

49

Example: Image Convolution

Padding must be done using actual pixel data from neighboring sub_images to avoid error.

Image A Padded Image Convolves w/ Kernel B

Flipped Kernel (–B)

Convolved Image C

Serial Implementation: C = conv2(A, B, ‘same’);

Image A Splits into sub_image Pads into work_image Convolves w/ Kernel B, each in serial manner

Merge into Convolved Image

Padding needed around the borders, taken care of by Matlab.

Parallel Implementation

50

Code: Skeleton% General initialization.MPI_Init; % Initialize MPI.comm = MPI_COMM_WORLD; % Create communicator.comm_size = MPI_Comm_size(comm); % Get size.my_rank = MPI_Comm_rank(comm); % Get rank.

% Prepare kernel(nFX, nFY).% Prepare image(nX, nY).% Split into sub_image(nSX, nSY) for each processor.% Prepare work_image based on sub_image with appropriate padding % (as shown on the next slide).

% Convolves each work_image with kernel.work_image = conv2(work_image, kernel, 'same');% Extract convolved results.sub_image = work_image(nFX/2+1:nFX/2+nSX, 1:nSY);

% Finalize MatlabMPI.MPI_Finalize;

% Host processor collects sub_images and merges into the final image.% End of program

51

Code: Padding% Find out ranks for left and right processors.left = my_rank - 1; if (left < 0), left = comm_size - 1; endright = my_rank + 1;if (right >= comm_size), right = 0; endltag = 1; rtag = 2; % Create message tags.

% Extract from left/right side of sub_image% and send to left/right processor.l_sub_image = sub_image(1:nFX/2, 1:nSY); MPI_Send(left, ltag, comm, l_sub_image); r_sub_image = sub_image(nSX-nFX/2+1:nSX, 1:nSY);MPI_Send(right, rtag, comm, r_sub_image);

% Prepare work_image(nSX+nFX, nSY).work_image = zeros(nSX+nFX, nSY); work_image(nFX/2+1:nFX/2+nSX, 1:nSY) = sub_image; % Copy sub_image into central part.r_pad = MPI_Recv(right, ltag, comm); % Receive right padding from right processor.work_image(nFX/2+nSX+1:nSX+nFX, 1:nSY) = r_pad; % Put into right part of work_image.l_pad = MPI_Recv(left, rtag, comm); % Receive left padding from left processor.work_image(1:nFX/2, 1:nSY) = l_pad; % Put into left part of work_image.

sub_image(left)

sub_image(self)

sub_image(right)

work_image (self)

r_padl_pad

MPI_Send/RecvMPI_Send/Recv

52

How to Run

% Abort left over jobs.MPI_Abort;pause(2.0);

% Delete left over MPI directory.MatMPI_Delete_all;pause(2.0);

% Define machines; empty means run locally.machines = {};

% Define machines.%machines = {‘machineA:/directoryA' ...% 'machineB:/directoryB'};

% Run scripts.eval(MPI_Run(‘convolveImage', 2, machines));

An Example Script (RUN.m from the package).

• System Requirements:– Shared memory systems

require a single Matlablicense; distributed memory systems require one Matlab license per machine.

– A directory visible to every machine (defaults to the launching directory but can be changed).

53

Comparison with MPI

• File I/O is less efficient.

• More desirable if many other Matlab

toolboxes may be utilized for the application.

54

pMatlab

• Newer package built upon MatlabMPI.

• Hides message passing from programmer.

• Utilizes a global array library, implemented using

MatlabMPI.

• Interested? Check it out at

http://www.ll.mit.edu/mission/cybersec/softwar

etools/pmatlab/pmatlab.html

55

Questions?

Thank You!

56

Documents

Parallel Computing Basics - Professional Web Presencepwp.gatech.edu/ece-cegp/wp-content/uploads/sites/564/... · 2017-01-12 · @ 2.1 GHz 64 GB RAM Gb Ethernet NIC 16 (32 HT) Intel