HPC Essentials

HPC EssentialsPart I : UNIX/C Overview

Bill BrouwerResearch Computing and Cyberinfrastructure

(RCC), PSU

[email protected]

Outline

●Introduction● Hardware● Definitions● UNIX

● Kernel & shell●Files

● Permissions● Utilities● Bash Scripting

●C programming

[email protected]

HPC Introduction●HPC systems composed of :

● Software● Hardware

● Devices (eg., disks)● Compute elements (eg., CPU)● Shared and/or distributed memory● Communication (eg., Infiniband network)

●A HPC system ...isn't... unless hardware is configured correctly and software leverages all resources made available to it, in an optimal manner●An operating system controls the execution of software on the hardware; HPC clusters almost exclusively use UNIX/Linux

●In the computational sciences, we pass data and/or abstractions through a pipelined workflow; UNIX is the natural analogue to this solving/discovery process

[email protected]

UNIX

●UNIX is a multi-user/tasking OS created by Dennis Ritchie and Ken Thompson at AT&T Bell Labs 1969-1970, written primarily in C language (also developed by Ritchie)

●UNIX is composed of :● Kernel

● OS itself which handles scheduling, memory management, I/O etc● Shell (eg., Bash)

● Interacts with kernel, command line interpreter● Utilities

● Programs run by the shell, tools for file manipulation, interaction with the system

● Files● Everything but process(es), composed of data...

[email protected]

Data-Related Definitions●Binary

● Most fundamental data representation in computing, base 2 number system (others; hex → base 16, oct → base 8)

●Byte● 8 bits = 8b = 1Byte = 1B; 1kB = 1024 B; 1MB = 1024 kB etc

●ASCII● American Standard Code for Information Interchange; character encoding

scheme, 7bits (traditional) or 8bits (UTF-8) per character, a Unicode encoding

●Stream● A flow of bytes; source → stdout (& stderr), sink → stdin

●Bus● Communication channel over which data flows, connects elements within a

machine●Process

● Fundamental unit of computational work performed by a processor; CPU executes application or OS instructions

●Node● Single computer, composed of many elements, various architectures for

CPU, eg., x86, RISC

[email protected]

Typical Compute Node (Intel i7)CPU

IOH

ICH

QuickPath Interconnect

memory busRAM

PCI-express

GPU

PCI-e cards

SATA/USB

Direct Media Interface

non-volatile storage

BIOS

ethernetNETWORK

volatile storage

[email protected]

More Definitions●Cluster

● Many nodes connected together via network●Network

● Communication channel, inter-node; connects machines●Shared Memory

● Memory region shared within node●Distributed Memory

● Memory region across two or more nodes●Direct Memory Access (DMA)

● Access memory independently of programmed I/O ie., independent of the CPU

●Bandwidth● Rate of data transfer across serial or parallel communication channel,

expressed as bits (b) or Bytes (B) per second (s)● Beware quotations of bandwidth; many factors eg., simplex/duplex,

peak/sustained, no. of lanes etc● Latency or the time to create a communication channel is often more

important

[email protected]

Bandwidths●Devices

● USB : 60MB/s (version 2.0)● Hard Disk : 100MBs-500MB/s● PCIe : 32GB/s (x8, version 2.0)

●Networks● 10/100Base T : 10/100 Mbit/s● 1000BaseT (1GigE) : 1000 Mbit/s● 10 GigE : 10 Gbit/s● Infiniband QDR 4X: 40 Gbit/s

●Memory● CPU : ~ 35 GB/s (Nehalem, 3x 1.3GHz DIMM/socket)*● GPU : ~ 180 GB/s (GeForce GTX 480)

●AVOID devices, keep data resident in memory, minimize communication btwn processes●MANY subtleties to CPU memory management eg., with 8x CPU cores, total bandwidth may be > 300 GB/s or as little as 10 GB/s, will discuss further *http://www.delltechcenter.com/page/04-08-2009+-+Nehalem+and+Memory+Configurations?t=anon#fbid=XZRzflqVZ6J

[email protected]

Outline

●Introduction● HPC hardware● Definitions● UNIX



●C programming

[email protected]

UNIX Permissions & Files

●At the highest level, UNIX objects are either files or processes, and both are protected by permissions (processes next time)●Every file object has two ID's, the user and group, both are assigned on creation; only the root user has unrestricted access to everything●Files also have bits which specify read (r), write (w) and execute (x) permissions for the user, group and others eg., output of ls command:

rwrr 1 root root 0 Jun 11 1976 /usr/local/foo.txt

●We can manipulate files using myriad utilities, these utilities are commands interpreted by the shell and executed by the kernel●To learn more, check man pages ie., from the command line 'man <command>'

User ID Group ID filenameuser/group/others

[email protected]

File Manipulation I●Working from the command line in a Bash shell:

●List directory foo_dir contents, human readable :[wjb19@lionga scratch] $ ls lah foo_dir

●Change ownership of foo.xyz to wjb19; group and user:[wjb19@lionga scratch] $ chown wjb19:wjb19 foo.xyz

●Add execute permission to foo.xyz:[wjb19@lionga scratch] $ chmod +x foo.xyz

●Determine filetype for foo.xyz:[wjb19@lionga scratch] $ file foo.xyz

●Peruse text file foo.xyz:[wjb19@lionga scratch] $ more foo.xyz

[email protected]

mailto:wjb19@lionga

mailto:wjb19@lionga

mailto:wjb19@lionga

mailto:wjb19@lionga

mailto:wjb19@lionga

File Manipulation II●Copy foo.txt from lionga to file /home/bill/foo.txt on dirac :[wjb19@lionga scratch] $ scp foo.txt \ [email protected]:/home/bill/foo.txt

●Create gzip compressed file archive of directory foo and contents :[wjb19@lionga scratch] $ tar cfz foo_archive.tgz foo/*

●Create bzip2 compressed file archive of directory foo and contents :[wjb19@lionga scratch] $ tar cfj foo_archive.tbz foo/*

●Unpack compressed file archive :[wjb19@lionga scratch] $ tar xvf foo_archive.tgz

●Edit a text file using VIM:[wjb19@lionga scratch] $ vim foo.txt

●VIM is a venerable and powerful command line editor with a rich set of commands

[email protected]

mailto:wjb19@lionga

mailto:wjb19@lionga

mailto:wjb19@lionga

mailto:wjb19@lionga

mailto:wjb19@lionga

Text File Edit w/ VIM●Two main modes of operation; editing or command. From command, switch to edit by issuing 'a' (insert after cursor) or 'i' (before), switch back to command via <ESC>

● Paste: 'p' undo: 'u' redo: '<CNTRL>r'

● Move up/down one screen line : '' and '+'

● Search for expression exp, forward ('n' or 'N' navigate up/down highlighted matches) '/exp<ENTER>' or backward '?exp<ENTER>'

Save w/o quitting :w<ENTER>

Save and quit (ie., <shift> AND 'z' AND 'z') :wq<ENTER>

Quit w/o saving :q!<ENTER>

Delete x lines eg,. x=10 (also stored in clipboard) d10d

Yank (copy) x lines eg., x=10 y10y

Split screen/buffer :split<ENTER>

Switch window/buffer <CNTRL>ww

Go to line x eg., x=10 :10<ENTER>

Find matching construct (eg., from { to }) %

[email protected]

Text File Compare w/ VIMDIFF●Same commands as VIM, but highlights differences in files, allows transfer of text btwn buffers/files; launch with 'vimdiff foo.txt foo2.txt'

●Push text from right to left (when right window active and cursor in relevant region) using command 'dp'●Pull text from right to left (when left window active and cursor in relevant region) using command 'do'

[email protected]

Bash Scripting●File and other utilities can be assembled into scripts, interpreted by the shell eg., Bash●The scripts can be collections of commands/utilities & fundamental programming constructs

*Streams have file descriptors (numbers) associated with them; eg., to redirect stderr from procA to foo.txt → procA 2> foo.txt

●

Code Comment #this is a comment

Pipe stdout of procA to stdin of procB procA | procB

Redirect stdout of procA to file foo.txt* procA > foo.txt

Command separator procA; procB

If block if [condition] then procA fi

Display on stdout echo “hello”

Variable assignment & literal value a = “foo”; echo $a

Concatenate strings b=a.“foo2”;

Text Processing utilities sed,gawk

Search utilities find,grep

[email protected]

Text Processing●Text documents are composed of records (roughly speaking, lines separated by carriage returns) and fields (separated by spaces)

●Text processing using sed & gawk involves coupling patterns with actions eg., print field 1 in document foo.txt when encountering word image:

[wjb19@lionga scratch] $ gawk '/image/ {print $1;}' “foo.txt”

●Parse, without case sensitivity, change from default space field separator (FS) to equals sign, print field 2:

[wjb19@lionga scratch] $ gawk 'BEGIN{IGNORECASE=1; FS=”=”} \ /image/ {print $2;}' “foo.txt”

●Putting it all together → create a Bash script w/ VIM or other (eg,. Pico)...

[email protected]

pattern action input

mailto:wjb19@lionga

mailto:wjb19@lionga

Bash Example I#!/bin/bash

#set source and destination pathsDIR_PATH=~/scratch/espressoPRACE/PWBAK_PATH=~/scratch/PW_BAK

declare a file_list

#filenames to arrayfile_list=$(ls l ${BAK_PATH} | gawk '/f90/ {print $9}')

cnt=0;

#parse files & pretty upfor x in $file_listdo let "cnt+=1" sed 's/\,\&/\,\ \&/g' $BAK_PATH/$x | \ sed 's/)/)\ /g' | \ sed 's/call/\ call\ /g' | \ sed 's/CALL/\ call\ /g' > $DIR_PATH/$x

echo cleaned file no. $cnt $xdone

exit

Run using bash

Declare an array

Search & replace

Command output

[email protected]

Bash Example II#!/bin/bash

if [ $# lt 6 ]then

echo usage: fitCPCPMG.sh '[/path/and/filename.csv] \[desired number of gaussians in mixture (210)] \ [no. random samples (100010000)]\[mcmc steps (100030000)]\ [percent noise level (010)]\[percent step size (0.0120)]\[/path/to/restart/filename.csv; optional]'

exitfi

ext=${1##*.}

if [ "$ext" != "csv" ]then echo ERROR: file must be *.csv exitfi

base=$(basename $1 .csv)

if [[ $2 lt 2 ]] || [[ $2 gt 10 ]]then

echo "ERROR: must specify 2<=x<=10 gaussians in mixture"exit

fi

Total arguments

File basename

File extension

[email protected]

Outline

●Introduction● HPC hardware● Definitions● UNIX



●C programming

[email protected]

The C Language●Utilities, user applications and indeed the UNIX OS itself are executed by the CPU, when expressed as machine code eg., store/load from memory, addition etc●Fundamental operations like memory allocation, I/O etc are laborious to express at this level, most frequently we begin from a high-level language like C●The process of creating an executable consists of at least 3 fundamental steps; creation of source code text file containing all desired objects and operations, compilation and linking eg,. using the GNU tool gcc to create executable foo.x from source file foo.c:[wjb19@tesla2 scratch]$ gcc std=c99 foo.c o foo.x

[email protected]

Source *c file

Object *o code

Executable

Library objects

compile link

*C99 standard

C Code Elements I

●Composed of primitive datatypes (eg., int, float, long), which have different sizes in memory, multiples of 1 byte

●May be composed of statically allocated memory (compile time), dynamically allocated memory (runtime), or both

●Pointers (eg., float *) are primitives with 4 or 8 byte lengths (32bit or 64bit machines) which contain an address to a contiguous region of dynamically allocated memory

●More complicated objects can be constructed from primitives and arrays eg., a struct

[email protected]

C Code Elements II

●Common operations are gathered into functions, the most common being main(), which must be present in executable

●Functions have a distinct name, take arguments, and return output; this information comprises the prototype, expressed separately to the implementation details, former often in header file

●Important system functions include read,write,printf (I/O) and malloc,free (Memory)

●The operating system executes compiled code; a running program is a process (more next time)

[email protected]

C Code Example#include <stdio.h>#include <stdlib.h>#include "allDefines.h"

//Kirchoff Migration function in psktmCPU.cvoid ktmMigrationCPU(struct imageGrid* imageX, struct imageGrid* imageY, struct imageGrid* imageZ, struct jobParams* config, float* midX, float* midY, float* offX, float* offY, float* traces, float* slowness, float* image);

int main(){

int IMAGE_SIZE = 10;float* image = (float*) malloc (IMAGE_SIZE*sizeof(float));printf(“size of image = %i\n”,IMAGE_SIZE);

for (int i=0; i<IMAGE_SIZE; i++)printf(“image point %i = %f\n”,i,image[i]);

free(image);return 0;

}[email protected]

Tells preprocessor to include these headers; system functions etc

Function prototype; must give arguments, their types and return type; implementation elsewhere

UNIX C Good Practice I●Use three streams, with file descriptors 0,1,2 respectively, allows assembly of operations into pipeline and these data streams are 'cheap' to use

●Only hand simple command line options to main() using argc,argv[]; in general we wish to handle short and long options (eg., see GNU coding standards) and the use of getopt_long() is preferable.

●Utilize the environment variables of the host shell, particularly in setting runtime conditions in executed code via getenv() eg., in Bash set in .bashrc config file or via command line:[wjb19@lionga scratch] $ export MY_STRING=hello

●If your project/program requires a) sophisticated objects b) many developers c) would benefit from object oriented design principles, you should consider writing in C++ (although being a higher-level language it is harder to optimize)

[email protected]

mailto:wjb19@lionga

UNIX C Good Practice II●In high performance applications, avoid system calls eg., read/write where control is given over to the kernel and processes can be blocked until the resource is ready eg., disk

● IF system calls must be used, handle errors and report to stderr● IF temporary files must be written, use mkstemp which sets

permissions , followed by unlink; the file descriptor is closed by the kernel when the program exists and the file removed

●Use assert to test validity of function arguments, statements etc; will introduce performance hit, but asserts can be removed at compile time with NDEBUG macro (C standard)

●Debug with gdb, profile with gprof, valgrind; target most expensive functions for optimization

●Put common functions in/use libraries wherever possible....

[email protected]

Key HPC Libraries

●BLAS/LAPACK/ScaLAPACK● Original basic and extended linear algebra routines ● http://www.netlib.org/

●Intel Math Kernel Library (MKL)● implementation of above routines, w/ solvers, fft etc ● http://software.intel.com/en-us/articles/intel-mkl/

●AMD Core Math Library (ACML)

● Ditto ● http://developer.amd.com/libraries/acml/pages/default.aspx

●OpenMPI● Open source MPI implementation● http://www.open-mpi.org/

●PETSc● Data structures and routines for parallel scientific applications based on PDE's● http://www.mcs.anl.gov/petsc/petsc-as/

[email protected]

http://www.netlib.org/

http://developer.amd.com/libraries/acml/pages/default.aspx

http://www.open-mpi.org/

http://www.mcs.anl.gov/petsc/petsc-as/

UNIX C Compilation I●In general the creation and use of shared libraries (*so) is preferable to static (*a), for space reasons and ease of software updates

●Program in modules and link separate objects

●Use fPIC flag in shared library compilation; PIC==position independent, code in shared object does not depend on address/location at which it is loaded.

●Use the make utility to manage builds (more next time)

●Don't forget to update your PATH and LD_LIBRARY_PATH env vars w/ your binary executable path & any libraries you need/created for the application, respectively

[email protected]

UNIX C Compilation II

●Remember in compilation steps to I/set/header/paths and keep interface (in headers) separate from implementation as much as possible

●Remember in linking steps for shared libs to:● L/set/path/to/library AND ● set flag lmyLib, where● /set/path/to/library/libmyLib.so must exist

otherwise you will have undefined references and/or 'can't find lmyLib' etc

●Compile with Wall or similar and fix all warnings

●Read the manual :)

[email protected]

Conclusions●High Performance Computing Systems are an assembly of hardware and software working together, usually based on the UNIX OS; multiple compute nodes are connected together

●The UNIX kernel is surrounded by a shell eg., Bash; commands and constructs may be assembled into scripts

●UNIX, associated utilities and user applications are traditionally written in high-level languages like C

●HPC user applications may take advantage of shared or distributed memory compute models, or both

●Regardless, good code minimizes I/O, keeps data resident in memory for as long as possible and minimizes communication between processes

●User applications should take advantage of existing high performance libraries, and tools like gdb, gprof and valgrind

[email protected]

References●Dennis Ritchie, RIP

● http://en.wikipedia.org/wiki/Dennis_Ritchie●Advanced bash scripting guide

● http://tldp.org/LDP/abs/html/●Text processing w/ GAWK

● http://www.gnu.org/s/gawk/manual/gawk.html●Advanced Linux programming

● http://www.advancedlinuxprogramming.com/alp-folder/●Excellent optimization tips

● http://www.lri.fr/~bastoul/local_copies/lee.html●GNU compiler collection documents

● http://gcc.gnu.org/onlinedocs/●Original RISC design paper

● http://www.eecs.berkeley.edu/Pubs/TechRpts/1982/CSD-82-106.pdf●C++ FAQ

● http://www.parashift.com/c++-faq-lite/●VIM Wiki

● http://vim.wikia.com/wiki/Vim_Tips_Wiki

[email protected]

http://www.gnu.org/s/gawk/manual/gawk.html

http://www.advancedlinuxprogramming.com/alp-folder/

http://www.lri.fr/~bastoul/local_copies/lee.html

http://gcc.gnu.org/onlinedocs/

http://www.eecs.berkeley.edu/Pubs/TechRpts/1982/CSD-82-106.pdf

http://www.parashift.com/c++-faq-lite/

Exercises●Take supplied code and compile using gcc, creating executable foo.x; attempt to run as './foo.x'●Code has a segmentation fault, an error in memory allocation which is handled via the malloc function●Recompile with debug flag g, run through gdb and correct the source of the segmentation fault●Load the valgrind module ie., 'module load valgrind' and then run as 'valgrind ./foo.x'; this powerful profiling tool will help identify memory leaks, or memory on the heap* which has not been freed

●Write a Bash script that stores your home directory file contents in an array and :

● Uses sed to swap vowels (eg., 'a' and 'e') in names● Parses the array of names and returns only a single match, if it exists,

else echo NOMATCH

*heap== region of dynamically allocated [email protected]

GDB quick start●Launch : [wjb19@tesla1 scratch]$ gdb ./foo.x

●Run w/ command line argument '100' : (gdb) run 100

●Set breakpoint at line 10 in source file : (gdb) b foo.c:10Breakpoint 1 at 0x400594: file foo.c, line 10.(gdb) runStarting program: /gpfs/scratch/wjb19/foo.x

Breakpoint 1, main () at foo.c:2222 int IMAGE_SIZE = 10;

●Step to next instruction (issuing 'continue' will resume execution) : (gdb) step23 float * image = (float*) malloc (IMAGE_SIZE*sizeof(float));

●Print second value in array 'image' :(gdb) p image[2]$4 = 0

●Display full backtrace :(gdb) bt full#0 main () at foo.c:27 i = 0 IMAGE_SIZE = 10 image = 0x601010

[email protected]

HPC EssentialsPart II : Elements of Parallelism


(RCC), PSU

[email protected]

Outline

●Introduction● Motivation

● HPC operations● Multiprocessors● Processes● Memory Digression

● Virtual Memory● Cache

●Threads● POSIX● OpenMP● Affinity

[email protected]

Motivation●The problems in science we seek to solve are becoming increasingly large, as we go down in scale (eg., quantum chemistry) or up (eg., astrophysics)

●As a natural consequence, we seek both performance and scaling in our scientific applications

●Therefore we want to increase floating point operations performed and memory bandwidth and thus seek parallelization as we run out of resources using a single processor

●We are limited by Amdahl's law, an expression of the maximum improvement of parallel code over serial:

1/((1-P) + P/N)

where P is the portion of application code we parallelize, and N is the number of processors ie., as N increases, the portion of remaining serial code becomes increasingly expensive, relatively speaking

[email protected]

Motivation●Unless the portion of code we can parallelize approaches 100%, we see rapidly diminishing returns with increasing numbers of processors

●Nonetheless, for many applications we have a good chance of parallelizing the vast majority of the code...

[email protected]

0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 2560

2

4

6

8

10

12

processors

Impr

ovem

ent

fact

or P=90%

P=60%P=30%P=10%

Example : Kirchhoff Time Migration●KTM is a technique used widely in oil+gas exploration, providing images into the earth's interior, used to identify resources ●Seismic trace data acquired over 2D geometry is integrated to give image of earth's interior, using ~ Green's method

●Input is generally 10^4 – 10^6 traces, 10^3 – 10^4 data points each, ie., lots of data to process; output image is also very large

●This is an integral technique (ie., summation, easy to parallelize), just one of many popular algorithms performed in HPC

[email protected] point Weight Trace Data

x==image space==seismic spacet==traveltime

Common Operations in HPC●Integration

● Load/store, add & multiply● eg., transforms

●Derivatives (Finite differences)● Load/store, subtract & divide● eg., PDE

●Linear Algebra● Load/store, subtract/add/multiply/divide● chemistry & physics, solvers● sparse (classical physics) & dense (quantum)

●Regardless of the operations performed, after compilation into machine code, when executed by the CPU, instructions are clocked through a pipeline into registers for execution

●Instruction execution generally takes place in four steps, and multiple instruction groups are concurrent within the pipeline; execution rate is a direct function of the clock rate

[email protected]

Execution Pipeline●This is the most fine-grained form of parallelism; it's efficiency is a strong function of branch prediction hardware, or the prediction of which instruction in a program is the next to execute*

●At a similar level, present in more recent devices are so-called streaming SIMD extension (SSE) registers and associated compute hardware

[email protected]

1.Fetch2.Decode3.Execute4.Write-back

Clock cycle 0 1 2 3 4 5 6 7

PIP

EL

INE

pending

executing

completed*assisted by compiler hints

SSE ●Streaming SIMD (Single instruction, multiple Data) computation exploits special registers and instructions to increase computation many-fold in certain cases, since several data elements are operated on simultaneously

●Each of 8 SSE registers (labeled xmm0 through xmm7) is 128-bit longs, storing 4 x 32-bit floating-point numbers; SSE2 and SSE3 specifications have expanded the allowed datatypes to include doubles, ints etc

●Operations may be 'scalar' or 'pack' (ie., vector), expressed using intrinsics in __asm block within C code eg.,

addps xmm0,xmm1

●One can either code the intrinsics explicitly, or rely on the compiler., eg., icc with optimization (O3)

●The next level up of parallelization is the [email protected]

float3 float2 float1 float0Bit 127 0

operation dst operand src operand

Multiprocessor Overview●Multiprocessors or multiple core CPU's are becoming ubiquitous; better scaling (cf Moore's law) but limited by contention for shared resources, especially memory

●Most commonly we deal with Symmetric Multiprocessors (SMP), with unique cache and registers, as well as shared memory region(s); more on cache in a moment

[email protected]

CPU0

main memory

registers

cache

CPU1

registers

cache

●Memory not necessarily next to processors → Non-uniform Memory Access (NUMA); try to ensure memory access is as local to CPU core(s) as possible

●The proc directory on UNIX machines is a special directory written and updated by the kernel, containing information on CPU (/proc/cpuinfo) and memory (/proc/meminfo)

●The fundamental unit of work on the cores is a process...

Processes

●Application processes are launched on the CPU by the kernel using the fork() system call; every process has a process ID pid, available on UNIX systems via the getpid() system call

●The kernel manages many processes concurrently; all information required to run a process is contained in the process control block (PCB) data structure, containing (among other things):

● The pid● The address space● I/O information eg., open files/streams● Pointer to next PCB

●Processes may spawn children using the fork() system call; children are initially a copy of the parent, but may take on different attributes via the exec() call

[email protected]

Processes●A child process takes the id of the parent (ppid), and additionally has a unique pid eg., output from ps command, describing itself :

[wjb19@tesla1 ~]$ ps eHo "%P %p %c %t %C" PPID PID COMMAND ELAPSED %CPU12608 1719 sshd 01:07:54 0.0 1719 1724 sshd 01:07:49 0.0 1724 1725 bash 01:07:48 0.0 1725 1986 ps 00:00 0.0

●During a context switch, kernel will swap one process control block for another; context switches are detrimental to HPC and have one or more triggers, including:

● I/O requests● Timer interrupts

●Context switching is a very fine-grained form of scheduling; on compute clusters we also have coarse grained scheduling in the form of job scheduling software (more next time)

●The unique address space from the perspective of the process is referred to as virtual memory

[email protected]

Virtual Memory●A running process is given memory by the kernel, referred to as virtual memory (VM); address space does not correspond to physical memory address space

●The Memory Management Unit (MMU) on CPU translates between the two address spaces, for requests made between process and OS

●Virtual Memory for every process has the same structure, below left; virtual address space is divided into units called pages

[email protected]

High Address

Low Address

Environment variablesFunction arguments

Stack

Unused

Heap

Instructions

●The MMU is assisted in address translation by the Translation Lookaside Buffer (TLB), which stores page details in a cache

●Cache is high speed memory immediately adjacent to the CPU and it's registers, connected via bus(es)

Cache : Introduction

●In HPC, we talk about problems being compute or memory bound

● In the former case, we are limited by the rate at which instructions can be executed by the CPU

● In the latter, we are limited by the rate at which data can be processed by the CPU

●Both instructions and data are loaded into cache; cache memory is laid out in lines

●Cache memory is intermediate in the overall hierarchy, lying between CPU registers and main memory

● If the executing process requests an address corresponding to data or instructions in cache, we have a 'hit', else 'miss', and a much slower retrieval of instruction or data from main memory must take place

[email protected]

Cache : Introduction●Modern architectures have various levels of cache and divisions of responsibilities, we will follow valgrind-cachegrind convention, from the manual:

[email protected]

... It simulates a machine with independent first-level instruction and data caches (I1 and D1), backed by a unified second-level cache (L2). This exactly matches the configuration of many modern machines.

However, some modern machines have three levels of cache. For these machines (in the cases where Cachegrind can auto-detect the cache configuration) Cachegrind simulates the first-level and third-level caches. The reason for this choice is that the L3 cache has the most influence on runtime, as it masks accesses to main memory. Furthermore, the L1 caches often have low associativity, so simulating them can detect cases where the code interacts badly with this cache (eg. traversing a matrix column-wise with the row length being a power of 2)

Cache Example ●The distribution of data to cache levels is largely set by compiler, hardware and kernel, however the programmer is still responsible for the best data access patterns in his/her code possible●Use cachegrind to optimize data alignment & cache usage eg.,

#include <stdlib.h>#include <stdio.h>

int main(){

int SIZE_X,SIZE_Y; SIZE_X=2048; SIZE_Y=2048;

float * data = (float*) malloc(SIZE_X*SIZE_Y*sizeof(float));

for (int i=0; i<SIZE_X; i++) for (int j=0; j<SIZE_Y; j++) data[j+SIZE_Y*i] = 10.0f * 3.14f; //bad data access //data[i+SIZE_Y*j] = 10.0f * 3.14f;

free(data);

return 0;}

[email protected]

Cache : Bad Access

bill@billHPEliteBook6930p:~$ valgrind tool=cachegrind ./foo.x==3088== Cachegrind, a cache and branchprediction profiler==3088== Copyright (C) 20022010, and GNU GPL'd, by Nicholas Nethercote et al.==3088== Using Valgrind3.6.1 and LibVEX; rerun with h for copyright info==3088== Command: ./foo.x==3088== ==3088== ==3088== I refs: 50,503,275==3088== I1 misses: 734==3088== LLi misses: 733==3088== I1 miss rate: 0.00%==3088== LLi miss rate: 0.00%==3088== ==3088== D refs: 33,617,678 (29,410,213 rd + 4,207,465 wr)==3088== D1 misses: 4,197,161 ( 2,335 rd + 4,194,826 wr)==3088== LLd misses: 4,196,772 ( 1,985 rd + 4,194,787 wr)==3088== D1 miss rate: 12.4% ( 0.0% + 99.6% )==3088== LLd miss rate: 12.4% ( 0.0% + 99.6% )==3088== ==3088== LL refs: 4,197,895 ( 3,069 rd + 4,194,826 wr)==3088== LL misses: 4,197,505 ( 2,718 rd + 4,194,787 wr)==3088== LL miss rate: 4.9% ( 0.0% + 99.6% )

[email protected]

instructions

data

lowest level

READ Ops WRITE Ops

Cache : Good Access

bill@billHPEliteBook6930p:~$ valgrind tool=cachegrind ./foo.x==4410== Cachegrind, a cache and branchprediction profiler==4410== Copyright (C) 20022010, and GNU GPL'd, by Nicholas Nethercote et al.==4410== Using Valgrind3.6.1 and LibVEX; rerun with h for copyright info==4410== Command: ./foo.x==4410== ==4410== ==4410== I refs: 50,503,275==4410== I1 misses: 734==4410== LLi misses: 733==4410== I1 miss rate: 0.00%==4410== LLi miss rate: 0.00%==4410== ==4410== D refs: 33,617,678 (29,410,213 rd + 4,207,465 wr)==4410== D1 misses: 265,002 ( 2,335 rd + 262,667 wr)==4410== LLd misses: 264,613 ( 1,985 rd + 262,628 wr)==4410== D1 miss rate: 0.7% ( 0.0% + 6.2% )==4410== LLd miss rate: 0.7% ( 0.0% + 6.2% )==4410== ==4410== LL refs: 265,736 ( 3,069 rd + 262,667 wr)==4410== LL misses: 265,346 ( 2,718 rd + 262,628 wr)==4410== LL miss rate: 0.3% ( 0.0% + 6.2% )

[email protected]

Cache Performance

●For large data problems, any speedup introduced by parallelization can easily be negated by poor cache utilization

●In this case, memory bandwidth is an order of magnitude worse for problem size (2^14)^2 (cf earlier note on widely variable memory bandwidths; we have to work hard to approach peak)

●In many cases we are limited also by random access patterns

[email protected]

10 11 12 13 140

2

4

6

8

10

12

log2 SIZE_X

time

( s)

Low % miss

High % miss

Outline

●Introduction● Motivation

● Computational operations● Multiprocessors● Processes● Memory Digression

● Virtual Memory● Cache

●Threads● POSIX● OpenMP● Affinity

[email protected]

POSIX Threads I

●A process may spawn one or more threads; on a multiprocessor, the OS can schedule these threads across a variety of cores, providing parallelism in the form of 'light-weight processes' (LWP)

●Whereas a child process receives a copy of the parent's virtual memory and executes independently thereafter, a thread shares the memory of the parent including instructions, and also has private data

●Using threads we perform shared memory processing (cf distributed memory, next time)

●We are at liberty to launch as many threads as we wish, although as you might expect, performance takes a hit as more threads are launched than can be scheduled simultaneously across available cores

[email protected]

POSIX Threads II

●Pthreads refers to the POSIX standard, which is just a specification; implementations exist for various systems

●Each pthread has:● An ID● Attributes :

● Stack size● Schedule information

●Much like processes, we can monitor thread execution using utilities such as top and ps

●The memory shared among threads must be used carefully in order to prevent race conditions, or threads seeing incorrect data during execution, due to more than one thread performing operations on said data, in an uncoordinated fashion

[email protected]

POSIX Threads III ●Race conditions may be ameliorated through careful coding, but also through explicit constructs eg., locks, whereby a single thread gains and relinquishes control→ implies serialization and computational overhead

●Multi-Threaded programs must also avoid deadlock, a highly undesirous state where one or more threads await resources, and in turn are unable to offer up resources required by others

●Deadlocks can also be avoided through good coding, as well as the use of communication techniques based around semaphores, for example

●Threads awaiting resources may sleep (context switch by kernel, slow, saves cycles) or busy wait (executes while loop or similar checking semaphore, fast, wastes cycles)

[email protected]

Pthreads Example#include <pthread.h>#include <stdio.h>#include <stdlib.h>

int sum; void *worker(void *param);

int main(int argc, char *argv[]){

pthread_t tid; pthread_attr_t attr;

if (argc!=2 || atoi(argv[1])<0){ printf("usage : a.out <int value>, where int value > 0\n"); return 1; } pthread_attr_init(&attr); pthread_create(&tid,&attr,worker,argv[1]); pthread_join(tid,NULL); printf("sum = %d\n",sum);}

void * worker(void *total){

int upper=atoi(total); sum = 0;

for (int i=0; i<upper; i++) sum += i;

pthread_exit(0);

}

[email protected]

global (shared) variable

thread id & attributes

worker thread creation & joinafter completion

local (private) variable

main thread

Valgrind-helgrind output[wjb19@hammer16 scratch]$ valgrind tool=helgrind v ./foo.x 100 ==5185== Helgrind, a thread error detector==5185== Copyright (C) 20072009, and GNU GPL'd, by OpenWorks LLP et al.==5185== Using Valgrind3.5.0 and LibVEX; rerun with h for copyright info==5185== Command: ./foo.x 100==5185== 5185 Valgrind options:5185 tool=helgrind5185 v5185 Contents of /proc/version:5185 Linux version 2.6.18274.7.1.el5 (mockbuild@x86004.build.bos.redhat.com) (gcc version

5185 REDIR: 0x3a97e7c240 (memcpy) redirected to 0x4a09e3c (memcpy)5185 REDIR: 0x3a97e79420 (index) redirected to 0x4a09bc9 (index)5185 REDIR: 0x3a98a069a0 (pthread_create@@GLIBC_2.2.5) redirected to 0x4a0b2a5 (pthread_create@*)5185 REDIR: 0x3a97e749e0 (calloc) redirected to 0x4a05942 (calloc)5185 REDIR: 0x3a98a08ca0 (pthread_mutex_lock) redirected to 0x4a076c2 (pthread_mutex_lock)5185 REDIR: 0x3a97e74dc0 (malloc) redirected to 0x4a0664a (malloc)5185 REDIR: 0x3a98a0a020 (pthread_mutex_unlock) redirected to 0x4a07b66 (pthread_mutex_unlock)5185 REDIR: 0x3a97e79b50 (strlen) redirected to 0x4a09cbb (strlen)5185 REDIR: 0x3a98a07a10 (pthread_join) redirected to 0x4a07431 (pthread_join)sum = 4950==5185== ==5185== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 3 from 3)5185 5185 used_suppression: 1 helgrindglibc2X1015185 used_suppression: 1 helgrindglibc2X1125185 used_suppression: 1 helgrindglibc2X102==5185== ==5185== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 3 from 3)

[email protected]

system calls establishing thread ie., there is a COST to create and destroy threads

Pthreads: Race Condition#include <pthread.h>#include <stdio.h>#include <stdlib.h>

int sum;void *worker(void *param);

int main(int argc, char *argv[]){

pthread_t tid; pthread_attr_t attr;

if (argc!=2 || atoi(argv[1])<0){ printf("usage : a.out <int value>, where int value > 0\n"); return 1; } pthread_attr_init(&attr); pthread_create(&tid,&attr,worker,argv[1]); int upper=atoi(argv[1]); sum=0; for (int i=0; i<upper; i++) sum+=i;

pthread_join(tid,NULL); printf("sum = %d\n",sum);}

[email protected]

main thread works on global variable as well, without synchronization/coordination

Helgrind output w/ race[wjb19@hammer16 scratch]$ valgrind tool=helgrind ./foo.x 100 ==5384== Helgrind, a thread error detector==5384== Copyright (C) 20072009, and GNU GPL'd, by OpenWorks LLP et al.==5384== Using Valgrind3.5.0 and LibVEX; rerun with h for copyright info==5384== Command: ./foo.x 100==5384== ==5384== Thread #1 is the program's root thread==5384== ==5384== Thread #2 was created==5384== at 0x3A97ED447E: clone (in /lib64/libc2.5.so)==5384== by 0x3A98A06D87: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread2.5.so)==5384== by 0x4A0B206: pthread_create_WRK (hg_intercepts.c:229)==5384== by 0x4A0B2AD: pthread_create@* (hg_intercepts.c:256)==5384== by 0x400748: main (fooThread2.c:18)==5384== ==5384== Possible data race during write of size 4 at 0x600cdc by thread #1==5384== at 0x400764: main (fooThread2.c:20)==5384== This conflicts with a previous write of size 4 by thread #2==5384== at 0x4007E3: worker (fooThread2.c:31)==5384== by 0x4A0B330: mythread_wrapper (hg_intercepts.c:201)==5384== by 0x3A98A0673C: start_thread (in /lib64/libpthread2.5.so)==5384== by 0x3A97ED44BC: clone (in /lib64/libc2.5.so)==5384==

●Pthreads is a versatile albeit large and inherently complicated interface

●We are primarily concerned with 'simply' dividing a workload among available cores; OpenMP proves much less unwieldy to use

[email protected]

built foo.x with debug on (-g) to find source file line(s) w/ error(s)

OpenMP Introduction

●OpenMP is a set of multi-platform/OS compiler directives, libraries and environment variables for readily creating multi-threaded applications

●The OpenMP standard is managed by a review board, and is defined by a large number of hardware vendors

●Applications written using OpenMP employ pragmas, or statements interpreted by the preprocessor (before compilation), representing functionality like fork & join that would take considerably more effort and care to implement otherwise

●OpenMP pragmas or directives indicate parallel sections of code ie., after compilation, at runtime, threads are each given a portion of work eg., in this case, loop iterations will be divided evenly among running threads :

#pragma omp parallel forfor (int i=0; i<SIZE; i++)

y[i]=x[i]*10.0f;

[email protected]

OpenMP Clauses I

●The number of threads launched during parallel blocks may be set via function calls or by setting the OMP_NUM_THREADS environment variable

●Data objects are generally by default shared (loop counters are private by default), a number of pragma clauses are available, which are valid for the scope of the parallel section eg., :

● private ● shared● firstprivate -initialized to value before parallel block● lastprivate -variable keeps value after parallel block ● reduction -thread safe way of combining data at conclusion of parallel

block

●Thread synchronization is implicit to parallel sections; there are a variety of clauses available for controlling this behavior also, including :

● critical-one thread at a time works in this section eg., in order to avoid race (expensive, design your code to avoid at all costs)

● atomic- safe memory updates performed using eg., mutual exclusion (cost)● barrier-threads wait at this point for others to arrrive [email protected]

OpenMP Clauses II

●OpenMP has default thread scheduling behavior handled via the runtime library, which may be modified through use of the schedule(type,chunk) clause, with types :

● static loop iterations are divided among threads equally by default; specifying an integer for the parameter chunk will allocate a number of contiguous iterations to a thread

● dynamic total iterations form a pool, from which threads work on small contiguous subsets until all are complete, with subset size given again by chunk

● guided a large section of contiguous iterations are allocated to each thread dynamically. The section size decreases exponentially with each successive allocation to a minimum size specified by chunk

[email protected]

OpenMP Example : KTM●In our first attempt at parallelization shortly, we simply add an OpenMP pragma before the computational loops in worker function:#pragma omp parallel for//loop over trace recordsfor (int k=0; k<config>traceNo; k++){

//loop over imageXfor(int i=0; i<Li; i++){

tempC = ( midX[k] imageXX[i]offX[k]) * (midX[k] imageXX[i]offX[k]); tempD = ( midX[k] imageXX[i]+offX[k]) * (midX[k] imageXX[i]+offX[k]);

//loop over imageY for(int j=0; j<Lj; j++){ tempA = tempC + ( midY[k] imageYY[j]offY[k]) * (midY[k] imageYY[j]offY[k]); tempB = tempD + ( midY[k] imageYY[j]+offY[k]) * (midY[k] imageYY[j]+offY[k]);

//loop over imageZ for (int l=0; l<Ll; l++){ temp = sqrtf(tauS[l] + tempA * slownessS[l]); temp += sqrtf(tauS[l] + tempB * slownessS[l]); timeIndex = (int) (temp / sRate);

if ((timeIndex < config>tracePts) && (timeIndex > 0)){ image[i*Lj*Ll + j*Ll + l] +=

traces[timeIndex + k * config>tracePts] * temp *sqrtf(tauS[l] / temp); } } //imageZ } //imageY } //imageX}//input trace records

[email protected]

OpenMP KTM Results●Scales well up to eight cores, then drops off; SMP model has deficiencies due to a number of factors, including :

● Coverage (Amdahl's law); as we increase processors, relative cost of serial code portion increases

● Hardware limitations● Locality...

[email protected]

1 2 4 8 160

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

CPU cores

Exe

cutio

n tim

e

CPU Affinity (Intel*)

●Recall that the OS schedules processes and threads using context switches; can be detrimental → threads may resume on different core, destroying locality

●We can change this by restricting threads to execute on a subset of processors, by setting processor affinity

●Simplest approach is to set environment variable KMP_AFFINITY to:● determine the machine topology,● assign threads to processors

●Usage:

KMP_AFFINITY=[<modifier>]<type>[<permute>][<offset>]

[email protected]*For GNU, ~ equivalent env var == GOMP_CPU_AFFINITY

CPU Affinity Settings

●The modifier may take settings corresponding to granularity (with specifiers: fine, thread, and core), as well as a processor list (proclist={<proclist>}), verbose, warnings and others

●The type settings refer to the nature of the affinity, and may take values :● compact-try to assign thread n+1 context as close as possible to n● disabled● explicit-force assign of threads to processors in proclist● none-just return the topology w/ verbose modifier● scatter-distribute as evenly as possible

●fine & thread refer to the same thing, namely that threads only resume in the same context; the core modifier implies that they may resume within a different context, but the same physical core

●CPU Affinity can effect application performance significantly and is worth tuning, based on your application and the machine topology...

[email protected]

CPU Topology Map

●For any given computational node, we have several different physical devices (packages in sockets), comprised of cores (eg., two here), which run one or two thread contexts

●Without hyperthreading, there is only a single context per core ie., modifiers thread/fine, core are indistinguishable

[email protected]

Node

packageA packageB

core1core0core1core0

0 1 0 1 0 1 0 1 Thread context

CPU Affinity Examples

●Display machine topology map eg,. Hammer :[wjb19@hammer16 scratch] $ export KMP_AFFINITY=verbose,none[wjb19@hammer16 scratch] $ ./psktm.xOMP: Info #204: KMP_AFFINITY: decoding cpuid leaf 11 APIC ids.OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 infoOMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11}OMP: Info #156: KMP_AFFINITY: 12 available OS procsOMP: Info #157: KMP_AFFINITY: Uniform topologyOMP: Info #179: KMP_AFFINITY: 2 packages x 6 cores/pkg x 1 threads/core (12 total cores)OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}

[email protected]

mailto:wjb19@hammer16

CPU Affinity Examples●Set affinity with compact setting, fine granularity :[wjb19@hammer5 scratch]$ export KMP_AFFINITY=verbose,granularity=fine,compact[wjb19@hammer5 scratch]$ ./psktm.x OMP: Info #204: KMP_AFFINITY: decoding cpuid leaf 11 APIC ids.OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 infoOMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11}OMP: Info #156: KMP_AFFINITY: 12 available OS procsOMP: Info #157: KMP_AFFINITY: Uniform topologyOMP: Info #179: KMP_AFFINITY: 2 packages x 6 cores/pkg x 1 threads/core (12 total cores)OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 0 core 1 OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 2 OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 8 OMP: Info #171: KMP_AFFINITY: OS proc 10 maps to package 0 core 9 OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 10 OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 1 core 0 OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 1 core 1 OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 1 core 2 OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 1 core 8 OMP: Info #171: KMP_AFFINITY: OS proc 11 maps to package 1 core 9 OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 1 core 10 OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {2}OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {10}OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {6}OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {1}OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {9}OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {5}OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {3}OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {11}

[email protected]

Conclusions●Scientific research is supported by computational scaling and performance, both provided by parallelism, limited to some extent by Amdahl's law

●Parallelism has various levels of granularity; at the finest level is the instruction pipeline and vectorized registers eg., SSE

●The next level up in parallel granularity is the multiprocessor; we may run many concurrent threads using the pthreads API or the OpenMP standard for instance

●Threads must be coded and handled with care, to avoid race and deadlock conditions

●Performance is a strong function of cache utilization; benefits introduced through parallelization can easily be negated by sloppy use of memory bandwidth

●Scaling across cores is limited by hardware, Amdahl's law but also locality; we have some control over the latter using KMP_AFFINITY for instance

[email protected]

References

●Valgrind (buy the manual, worth every penny)● http://valgrind.org/

●OpenMP● http://openmp.org/wp/

●GNU OpenMP● http://gcc.gnu.org/projects/gomp/

●Summary of OpenMP 3.0 C/C++ Syntax● http://openmp.org/mp-documents/OpenMP3.1-CCard.pdf

●Summary of OpenMP 3.0 Fortran Syntax● http://www.openmp.org/mp-documents/OpenMP3.0-FortranCard.pdf

●Nice SSE tutorial ● http://neilkemp.us/src/sse_tutorial/sse_tutorial.html

●Intel Nehalem ● http://en.wikipedia.org/wiki/Nehalem_%28microarchitecture%29

●GNU Make● http://www.gnu.org/s/make/

●Intel hyperthreading● http://en.wikipedia.org/wiki/Hyper-threading

[email protected]

http://gcc.gnu.org/projects/gomp/

http://neilkemp.us/src/sse_tutorial/sse_tutorial.html

http://www.gnu.org/s/make/

Exercises

●Take the supplied code and parallelize using OpenMP pragma around the worker function●Create a makefile which builds the code, compare timings btwn serial & parallel by varying OMP_NUM_THREADS●Examine effect of various settings for KMP_AFFINITY

[email protected]

Build w/ Confidence : make#Makefile for basic Kirchhoff Time Migration example

#set compilerCC=icc openmp

#set build optionsCFLAGS=std=c99 c

#main executableall: psktm.x

#objects and dependenciespsktm.x: psktmCPU.o demoA.o $(CC) psktmCPU.o demoA.o o psktm.x

psktmCPU.o: psktmCPU.c $(CC) $(CFLAGS) psktmCPU.c

demoA.o: demoA.c $(CC) $(CFLAGS) demoA.c

clean: rm rf *o psktm.x

[email protected] with tab only!

HPC EssentialsPart III : Message Passing Interface


(RCC), PSU

[email protected]

Outline

●Motivation●Interprocess Communication

● Signals● Sockets & Networks

●procfs Digression●Message Passing Interface

● Send/Receive● Communication● Parallel Constructs● Grouping Data● Communicators & Topologies

[email protected]

Motivation●We saw last time that Amdahl's law implies an asymptotic limit to performance gains from parallelism, where parallel P and serial code (1-P) portions have fixed relative cost

●We looked at threads (“light-weight processes”) and also saw that performance depends on a variety of things, including good cache utilization and affinity

●For the problem size investigated, ultimately the limiting factor was disk I/O, there was no sense going beyond a single compute node; in a machine with 16 cores or more, there is no point when P < 60%, should the process have sufficient memory

●However, as we increase our problem size, the relative parallel/serial cost changes and P can approach 1

[email protected]

Motivation●In the limit as processors N → we find the maximum performance improvement :

1/(1-P)●It is helpful to see the 3dB points for this limit ie., the number of processors N

1/2

required to achieve (1/√2)*max = 1/(√2*(1-P)); equating with Amdahl's law & after some algebra :

N1/2

= 1/((1-P)*(√2-1))

[email protected]

0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.990

50

100

150

200

250

300

Parallel code fraction P

N1/

2

Motivation●Points to note from the graph :

● P ~ 0.90, we can benefit from ~ 20 cores● P ~ 0.99, we can benefit from a cluster size of ~ 256 cores ● P → 1, we approach the “embarrassingly parallel” limit● P ~ 1, performance improvement directly proportional to cores● P ~ 1 implies independent or batch processes

●Quite aside from considerations of Amdahl's law, as the problem size grows, we may simply exceed the memory available on a single node

● In this case, must move to a distributed memory processing model/multiple nodes (unless P ~ 1 of course)

●How do we determine P? → PROFILING

[email protected]

Profiling w/ Valgrind[wjb19@lionxf scratch]$ valgrind tool=callgrind ./psktm.x[wjb19@lionxf scratch]$ callgrind_annotate inclusive=yes callgrind.out.3853 Profile data file 'callgrind.out.3853' (creator: callgrind3.5.0)I1 cache: D1 cache: L2 cache: Timerange: Basic block 0 2628034011Trigger: Program terminationProfiled target: ./psktm.x (PID 3853, part 1)

20,043,133,545 PROGRAM TOTALS Ir file:function20,043,133,545 ???:0x0000003128400a70 [/lib64/ld2.5.so]20,042,523,959 ???:0x0000000000401330 [/gpfs/scratch/wjb19/psktm.x]20,042,522,144 ???:(below main) [/lib64/libc2.5.so]20,042,473,687 /gpfs/scratch/wjb19/demoA.c:main20,042,473,687 demoA.c:main [/gpfs/scratch/wjb19/psktm.x]19,934,044,644 psktmCPU.c:ktmMigrationCPU [/gpfs/scratch/wjb19/psktm.x]19,934,044,644 /gpfs/scratch/wjb19/psktmCPU.c:ktmMigrationCPU 6,359,083,826 ???:sqrtf [/gpfs/scratch/wjb19/psktm.x] 4,402,442,574 ???:sqrtf.L [/gpfs/scratch/wjb19/psktm.x] 104,966,265 demoA.c:fileSizeFourBytes [/gpfs/scratch/wjb19/psktm.x]

[email protected]

Parallelizable worker function is 99.5% of total instructions executed

If we wish to scale outside a single node, we must use some form of interprocess communication

Inter-Process Communication●There are a variety of ways for processes to exchange information, including:

● Memory (~last week)● Files● Pipes (named/anonymous)● Signals● Sockets● Message Passing

●File I/O is too slow, and read/writes liable to race conditions

● Anonymous & named pipes are highly efficient but FIFO (first in, first out) buffers, allowing only unidirectional communication, and between processes on the same node

●Signals are a very limited form of communication, sent to the process after an interrupt by the kernel, and handled using a default handler or one specified using signal() system call

●Signals may come from a variety of sources eg., segmentation fault (SIGSEGV), keyboard interrupt Ctrl-C (SIGINT) etc

[email protected]

Signals●strace is a powerful utility in UNIX which shows the interaction between a running process and kernel in the form of system calls and signals; here, a partial output showing mapping of signals to defaults with system call sigaction(), from ./psktm.x :

rt_sigaction(SIGHUP, NULL, {SIG_DFL, [], 0}, 8) = 0rt_sigaction(SIGINT, NULL, {SIG_DFL, [], 0}, 8) = 0rt_sigaction(SIGQUIT, NULL, {SIG_DFL, [], 0}, 8) = 0rt_sigaction(SIGILL, NULL, {SIG_DFL, [], 0}, 8) = 0rt_sigaction(SIGABRT, NULL, {SIG_DFL, [], 0}, 8) = 0rt_sigaction(SIGFPE, NULL, {SIG_DFL, [], 0}, 8) = 0rt_sigaction(SIGBUS, NULL, {SIG_DFL, [], 0}, 8) = 0rt_sigaction(SIGSEGV, NULL, {SIG_DFL, [], 0}, 8) = 0rt_sigaction(SIGSYS, NULL, {SIG_DFL, [], 0}, 8) = 0rt_sigaction(SIGTERM, NULL, {SIG_DFL, [], 0}, 8) = 0rt_sigaction(SIGPIPE, NULL, {SIG_DFL, [], 0}, 8) = 0

●Signals are crude and restricted to local communication; to communicate remotely, we can establish a socket between processes, and communicate over the network

[email protected]

UNIX signals

Sockets & Networks

●Davies/Baran first devised packet switching, an efficient means of communication over a channel; a computer was conceived to realize their design and ARPANET went online Oct 1969 between UCLA and Stanford

●TCP/IP became the communication protocol of ARPANET 1 Jan 1983, which was retired in 1990 and NFSNET established; university networks in the US and Europe join

●TCP/IP is just one of many protocols, which describes the format of data packets, and the nature of the communication; an analogous connection method is used by Infiniband networks in conjunction with Remote Direct Memory Access (RDMA)

●Unreliable Datagram Protocol (UDP) is analogous to a connectionless method of communication used by Infiniband high performance networks

[email protected]

Sockets : UDP host example

#include <stdio.h>#include <errno.h>#include <string.h>#include <sys/socket.h>#include <sys/types.h>#include <netinet/in.h>#include <unistd.h> /* for close() for socket */ #include <stdlib.h> int main(void){ //creates an endpoint & returns file descriptor //uses IPv4 domain, datagram type, UDP transport int sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP); //socket address object (sa) and memory buffer struct sockaddr_in sa; char buffer[1024]; ssize_t recsize; socklen_t fromlen; //specify same domain type, any input address and port 7654 to listen on memset(&sa, 0, sizeof sa); sa.sin_family = AF_INET; sa.sin_addr.s_addr = INADDR_ANY; sa.sin_port = htons(7654); fromlen = sizeof(sa);

[email protected]

Sockets : host example cont.

//we bind an address (sa) to the socket using fd sock if (1 == bind(sock,(struct sockaddr *)&sa, sizeof(sa))) { perror("error bind failed"); close(sock); exit(EXIT_FAILURE); } for (;;) { //listen and dump buffer to stdout where applicable printf ("recv test....\n"); recsize = recvfrom(sock, (void *)buffer, 1024, 0, (struct sockaddr *)&sa, &fromlen); if (recsize < 0) { fprintf(stderr, "%s\n", strerror(errno)); exit(EXIT_FAILURE); } printf("recsize: %z\n ", recsize); sleep(1); printf("datagram: %.*s\n", (int)recsize, buffer); }}

[email protected]

Sockets : client exampleint main(int argc, char *argv[]){ //create a buffer with character data int sock; struct sockaddr_in sa; int bytes_sent; char buffer[200]; strcpy(buffer, "hello world!"); //create a socket, same IP and transport as before, address of host 127.0.0.1 sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP); if (1 == sock) /* if socket failed to initialize, exit */ { printf("Error Creating Socket"); exit(EXIT_FAILURE); } memset(&sa, 0, sizeof sa); sa.sin_family = AF_INET; sa.sin_addr.s_addr = inet_addr("127.0.0.1"); sa.sin_port = htons(7654); bytes_sent = sendto(sock, buffer, strlen(buffer), 0,(struct sockaddr*)&sa, sizeof sa); if (bytes_sent < 0) { printf("Error sending packet: %s\n", strerror(errno)); exit(EXIT_FAILURE); } close(sock); /* close the socket */ return 0;}

●You can monitor sockets by using the netstat facility, which takes it's data from /proc/net [email protected]

Outline



●procfs Digression●Message Passing


[email protected]

procfs

●We mentioned the /proc directory previously in the context of cpu and memory information, which is frequently referred to as the proc filesystem or procfs

●It is a veritable treasure trove of information, written periodically by the kernel, and is used by a variety of tools eg., ps

●Each running process is assigned a directory, whose name is the process id

●Each directory contains text files and subdirectories with every detail of a running process, including context switching statistics, memory management, open file descriptors and much more

●Much like the ptrace() system call, procfs also gives user applications the ability to directly manipulate running processes, given sufficient permission; you can explore that on your own :)

[email protected]

procfs : examples●Some of the more useful files :

● /proc/PID/cmdline : command used to launch process● /proc/PID/cwd : current working directory● /proc/PID/environ : environment variables for the process● /proc/PID/fd : directory w/ symbolic link for each open file descriptor eg., streams● /proc/PID/status : information including signals, state, memory usage● /proc/PID/maps : memory map between virtual and physical addresses

●

●eg., contents of the fd firectory for running process ./psktm.x :[wjb19@hammer1 fd]$ ls lahtotal 0drx 2 wjb19 wjb19 0 Dec 7 12:13 .drxrxrx 6 wjb19 wjb19 0 Dec 7 12:10 ..lrwx 1 wjb19 wjb19 64 Dec 7 12:13 0 > /dev/pts/28lrwx 1 wjb19 wjb19 64 Dec 7 12:13 1 > /dev/pts/28lrwx 1 wjb19 wjb19 64 Dec 7 12:13 2 > /dev/pts/28lrwx 1 wjb19 wjb19 64 Dec 7 12:13 3 > /gpfs/scratch/wjb19/inputDataSmall.binlrwx 1 wjb19 wjb19 64 Dec 7 12:13 4 > /gpfs/scratch/wjb19/inputSrcXSmall.binlrwx 1 wjb19 wjb19 64 Dec 7 12:13 5 > /gpfs/scratch/wjb19/inputSrcYSmall.binlrwx 1 wjb19 wjb19 64 Dec 7 12:13 6 > /gpfs/scratch/wjb19/inputRecXSmall.binlrwx 1 wjb19 wjb19 64 Dec 7 12:13 7 > /gpfs/scratch/wjb19/inputRecYSmall.binlrwx 1 wjb19 wjb19 64 Dec 7 12:13 8 > /gpfs/scratch/wjb19/velModel.bin

[email protected]

procfs : status file extract[wjb19@hammer1 30769]$ more statusName: psktm.xState: R (running)SleepAVG: 0%Tgid: 30769Pid:30769PPid: 30687TracerPid: 0Uid:2511 2511 2511 2511Gid:2530253025302530FDSize: 256Groups: 2472 2530 3835 4933 5505 5732 VmPeak: 65520 kBVmSize: 65520 kBVmLck: 0 kBVmHWM: 37016 kBVmRSS: 37016 kBVmData: 51072 kBVmStk: 88 kBVmExe: 64 kBVmLib: 2944 kBVmPTE: 164 kBStaBrk: 1289a000 kBBrk: 128bb000 kBStaStk: 7fffbd0a0300 kBThreads: 5SigQ: 0/398335SigPnd: 0000000000000000ShdPnd: 0000000000000000SigBlk: 0000000000000000SigIgn: 0000000000000000SigCgt: 0000000180000000

[email protected]

Virtual memory usage

signals

Outline



●procfs Digression●Message Passing Interface


[email protected]

Message Passing Interface (MPI)●Classical von Neumann machine has single instruction/data stream (SISD) → single process & memory

●Multiple Instruction, multiple data (MIMD) system → connected processes are asynchronous, generally distributed memory (may also be shared where processes on single node)

●MIMD Processors are connected in some network topology; we don't have to worry about the details, MPI abstracts this away

●MPI is a standard for parallel programming first established in 1991, updated occasionally, by academics and industry

●It comprises routines for point-to-point and collective communication, with bindings to C/C++ and fortran

●Depending on underlying network fabric, communication maybe TCP or UDP- like in Infiniband networks

[email protected]

MPI : Basic communication

●Multiple, distributed processes are spawned at initialization, each process assigned a unique rank 0,1,...,p-1

●One may send information referencing process rank eg.,:

MPI_Send(&x, 1, MPI_FLOAT, 1, 0, MPI_COMM_WORLD);

●This function has a receive analogue; both routines are blocking by default

●Send/receive statements generally occur in same code, processors execute appropriate statement according to rank & code branch

●Non-blocking functions available, allows communicating processes to continue with execution where able

Buffer address Rank of rcv

[email protected]

MPI : Requisite functions

●Bare minimum → initialize, get rank for process, total processes and finalize when done

MPI_Init(&argc, &argv); //Start upMPI_Comm_rank(MPI_COMM_WORLD,&my_rank); //My rankMPI_Comm_size(MPI_COMM_WORLD, &p); //No. processorsMPI_Finalize(); //close up shop

●MPI_COMM_WORLD is a communicator parameter, a collection of processes that can send messages to each other.

●Messages are sent with tags to identify them, allowing specificity beyond using just a source/destination parameter

[email protected]

MPI : Datatypes

MPI_CHAR signed char

MPI_SHORT signed short int

MPI_INT signed int

MPI_LONG signed long int

MPI_UNSIGNED_CHAR unsigned char

MPI_UNSIGNED_SHORT unsigned short int

MPI_UNSIGNED unsigned int

MPI_UNSIGNED_LONG unsigned long int

MPI_FLOAT float

MPI_DOUBLE double

MPI_LONG_DOUBLE long double

MPI_BYTE

MPI_PACKED

[email protected]

Minimal MPI example#include "mpi.h"#include <stdio.h>

int main(int argc, char *argv[]){ int rank, size, i; int buffer[10]; MPI_Status status;

MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if (rank > 0) {

for (int i =0; i<10; i++) buffer[i]=i * rank;

MPI_Send(buffer, 10, MPI_INT, 0, 0, MPI_COMM_WORLD); } else { for (int i=1; i<size; i++){ MPI_Recv(buffer, 10, MPI_INT, i, 0, MPI_COMM_WORLD, &status); printf("buffer element 0 : %i from proc : %i \n",buffer[0],i); } } MPI_Finalize(); return 0;}

[email protected]

MPI : Collective Communication

● A communication pattern involving all processes in a communicator is a collective communication eg., a broadcast

● Same data sent to every process in communicator, more efficient than using multiple p2p routines, optimized :

MPI_Bcast(void* message, int count, MPI_Datatype type, int root, MPI_Comm comm)

● Sends copy of data in message from root process to all in comm, a scatter/map operation

● Collective communication is at the heart of efficient parallel operations

[email protected]

Parallel Operations : Reduction

● Data maybe gathered/reduced after computation via :

MPI_Reduce(void* operand, void* result, int count, MPI_Datatype type, MPI_Op operator, int root, MPI_Comm comm)

● Combines all operand, using operator and stores result on process root, in result

● A tree-structured reduce at all nodes == MPI_Allreduce,ie., every process in comm gets a copy of the result

[email protected]

2 31 p-1

0 root

Reduction Ops

MPI_MAX

MPI_MIN

MPI_SUM

MPI_PROD

MPI_LAND Logical and

MPI_BAND Bitwise and

MPI_LOR Logical or

MPI_BOR Bitwise or

MPI_LXOR Logical XOR

MPI_BXOR Bitwise XOR

MPI_MAXLOC Max w/ location

MPI_MINLOC Min w/ location

MPI_PACKED

[email protected]

Parallel Operations : Scatter/Gather

● Bulk transfers of many-to-one and one-to-many are accomplished by gather and scatter operations respectively

● These operations form the kernel of matrix/vector operations for example; they are useful for distributing and reassembling arrays

Process 0

Process 1

Process 2

Process 3

x0

x3

x2

x1

a00

a01

a02

a03

Gather Scatter

[email protected]

Scatter/Gather Syntax

● MPI_Gather(void* send_data, int send_count, MPI_Datatype send_type, void* recv_data, int recv_count, MPI_Datatype recv_type, int root, MPI_Comm comm)

● Collects data referenced by send_data from each process in comm and stores data in process rank order on process w/ rank root, in memory referenced by recv_data

● MPI_Scatter(void* send_data, int send_count, MPI_Datatype send_type, void* recv_data, int recv_count, MPI_Datatype recv_type, int root, MPI_Comm comm)

● Splits data referenced by send_data on process w/ rank root into segments, send_count elements each, w/ send_type & distributed in order to processes

● For gather result to ALL processes → MPI_Allgather

[email protected]

Grouping Data I● Communication is expensive → bundle variables into single message● We must define a derived type than can describe the heterogeneous

contents of a message using type and displacement pairs● Several ways to build this MPI_Datatype eg.,

MPI_Type_Struct(int count,int block_lengths[], //contains no. entries in each blockMPI_Aint displacements[], //element offset from msg startMPI_Datatype typelist[], //exactly thatMPI_Datatype* new_mpi_t //a pointer to this new type)

● A very general derived type, although arrays to struct must be constructed explicitly using other MPI commands

● Simpler when less heterogeneous eg., MPI_Type_vector, MPI_Type_Contiguous, MPI_Type_indexed

Allows for addresses > int

[email protected]

Grouping Data II

● Before these derived types can be used by a communication function, must be committed with MPI_type_commit function call

● In order for message to be received, type signatures at send and receive must be compatible; if a collective communication, signatures must be identical

● MPI_Pack & MPI_Unpack are useful for when messages of heterogeneous data are infrequent, and cost of constructing derived type outweighs benefit

● These methods also allow buffering in user versus system memory, and the number of items transmitted is in the message itself

● Group data allows for sophisticated objects; we can also create more fined grained communication objects

[email protected]

Communicators

● Process subsets or groups expand communication beyond simple p2p and broadcast communication, to create :

● Intra-communicators → communicate among one other and participate in collective communication, composed of :

– an ordered collection of processes (group)– a context

● Inter-communicators → communicate between different groups

● Communicators/groups are opaque, internals not directly accessible; these objects are referenced by a handle

[email protected]

Communicators Cont.

● Internal contents manipulated by methods, much like private data in C++ class objects eg.,

● int MPI_Group_incl(MPI_Group old_group,int new_group_size, int ranks_in_old_group[], MPI_Group* new_group) → create a new_group from old_group, using ranks_in_old_group[] etc

● int MPI_Comm_create(MPI_Comm old_comm, MPI_Group new_group, MPI_Comm* new_comm) → create a new communicator from the old, with context

● MPI_Comm_group and MPI_Group_incl are local methods without communication, MPI_Comm_create is a collective communication implying synchronization ie,. to establish single context

● Multiple communicators may be created simultaneously using MPI_Comm_split

[email protected]

Topologies I● MPI allows one to associate different addressing schemes to

processes within a group● This is a virtual versus real or physical topology, and is either a graph

structure or a (Cartesian) grid; properties:● Dimensions, w/

– Size of each– Period of each

● Option to have processes reordered optimally within grid● Method to establish Cartesian grid cart_comm :

int MPI_Cart_create(MPI_Comm old_comm, int number_of_dims, int dim_sizes[], int wrap_around[], int reorder, MPI_Comm* cart_comm)

● old_comm is typically just MPI_COMM_WORLD created at init

[email protected]

Topologies II● cart_comm will contain the processes from old_comm with

associated coordinates, available from MPI_Cart_coords:int coordinates[2];int my_grid_rank;MPI_Comm_rank(cart_comm, &my_grid_rank);MPI_Cart_Coords(cart_comm, my_grid_rank,2,coordinates);

● Call to MPI_Comm_rank is necessary because of process rank reordering (optimization)

● Processes in cart_comm are stored in row major order● Can also partition in to sub-grid(s) using MPI_Cart_sub eg., for row:

int free_coords[2];MPI_Comm row_comm; //new subgridfree_coords[0]=0; //bool; first coordinate fixedfree_coords[1]=1; //bool; second coordinate freeMPI_Cart_sub(cart_comm,free_coords,&row_comm);

[email protected]

Writing Parallel Code● Assuming we've profiled our code and decided to parallelize,

equipped with MPI routines, we must decide whether to take a :● Domain parallel (divide tasks, similar data) or

● Data parallel (divide data, similar tasks) approach

● Data parallel in general scales much better, implies lower communication overhead

● Regardless, easiest to begin by selecting or designing data structures, and subsequently their distribution using a constructed topology or scatter/gather routines, for example

● Program in modules, beginning with easiest/essential functions (eg., I/O), relegating 'hard' functionality to stubs initially

● Time code sections, look at targets for optimization & redesign

● Only concern yourself with the highest levels of abstraction germane to your problem, use parallel constructs wherever possible

[email protected]

A Note on the OSI Model

●We've been playing fast and loose with a variety of communication entities; sockets, networks, protocols like UDP, TCP etc●The Open Systems Interconnection model separates these entities into 7 layers of abstraction, each layer providing services to the layer immediately above●Data becomes increasingly fine grained going down from layer 7 to 1

●As application developers and/or scientists, we need only be concerned with layers 4 and above

[email protected]

Layer Granularity Function Example

7.Application data process accessing network MPI

6.Presentation data encryt/decrypt, data conversion MPI

5.Session data management MPI

4.Transport segments reliability & flow control IB verbs

3.Network packets path Infiniband

2.Data Link frames addressing Infiniband

1.Physical bits signals/electrical Infiniband

Conclusions●We can determine the parallel portion of our code through profiling; as a rule of thumb a code with P ~ 99% can effectively utilize about 256 cores, code with P ~ 90% about 20 cores

●When the parallel portion of code approaches 90%, we can justify going outside the multi-core node and using some form of inter-process communication (IPC)

●IPC comes in a variety of forms eg., sockets connected over networks, signals between processes on a single machine

●The message passing interface (MPI) abstracts away details of IPC used over networks, providing language bindings to C,fortran etc

●MPI has a number of highly optimized collective communication and parallel constructs, sophisticated means of grouping objects, as well as computational topologies

●The OSI Model assigns various communication entities to one of seven layers, we need only be concerned with layer four and above

[email protected]

References●Pacheco's excellent MPI text

● http://www.cs.usfca.edu/~peter/ppmpi/●Valgrind (no really, buy the manual)

● http://valgrind.org/●UNIX signals

● http://www.cs.pitt.edu/~alanjawi/cs449/code/shell/UnixSignals.htm●OpenMPI

● http://open-mpi.org/●procfs

● http://www.kernel.org/doc/man-pages/online/pages/man5/proc.5.html●Excellent article on ptrace

● http://linuxgazette.net/81/sandeep.html●Kernel vulnerabilities associated with ptrace/procfs

● http://www.kb.cert.org/vuls/id/176888●MPI tutorials

● http://www.mcs.anl.gov/research/projects/mpi/learning.html●Linux Gazette articles

● http://linuxgazette.net●Open Systems Interconnection

● http://en.wikipedia.org/wiki/OSI_Reference_Model●PBS reference

● http://rcc.its.psu.edu/user_guides/system_utilities/pbs/ [email protected]

http://valgrind.org/

http://www.kernel.org/doc/man-pages/online/pages/man5/proc.5.html

http://linuxgazette.net/81/sandeep.html

http://www.mcs.anl.gov/research/projects/mpi/learning.html

http://linuxgazette.net/

http://en.wikipedia.org/wiki/OSI_Reference_Model

Exercises

●Build the supplied MPI code via 'make f Makefile_' and submit to cluster of your choice using the following PBS script●Compare scaling with OpenMP example from last week, by varying both nodes and procs per node (ppn); differences? (NUMA vs good locality w/ MPI)

●Sketch how the gather function is collecting data, and the root process subsequently writes out to disk●Similarly, sketch how the image data exists in memory; are the two pictures commensurate? (hint: no :) )●Re-assign image grid tiles to processes such that no file manipulation is required after program completion

[email protected]

Scheduling on clusters : PBS●Basic submission script foo.pbs:

#PBS l nodes=4:ppn=4#PBS l mem=1Gb#PBS l walltime=00:20:00#PBS o /gpfs/home/wjb19/scratch/ktm_stdout.txt#PBS e /gpfs/home/wjb19/scratch/ktm_stderr.txt#PBS V

cd /gpfs/home/wjb19/scratchmodule load openmpi/gnu/1.4.2

mpirun ./psktm.x

●Submit to cluster :[wjb19@lionxf scratch]$ qsub foo.pbs

●Check status:[wjb19@lionxf scratch]$ qstat u wjb19

●List nodes for running jobs :[wjb19@lionxf scratch]$ qstat n

[email protected]

Debug MPI applications● MPI programs are readily debugged using serial programs like gdb, once

you have :● compiled with g & submitted your job

● an assigned node from qstat query

● process id on that node ie., ssh to node and

● run gdb pid==pid_for_the_proc

● OpenMPI.org give a useful code block to use in conjunction with this technique :

{ int i = 0; char hostname[256]; gethostname(hostname, sizeof(hostname)); printf("PID %d on %s ready for attach\n", getpid(), hostname); fflush(stdout); while (0 == i) sleep(5);}

● Once attached and working with gdb, you can set some breakpoints and alter the parameter i (eg., set var i=7) to move out of the loop

[email protected]

Documents

HPC Essentials