Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Programming with StarSs

Rosa M. BadiaComputer Sciences Research Dept.

BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 2

Agenda

• StarSs overview• SMPSs • SMPSs examples• Single node hands-on• Hybrid model MPI/SMPSs• Programming examples• MPI/SMPSs hands-on

• Slides available in:• http://marsa.ac.upc.edu/prace/course_material_starss• /gpfs/scratch/bsc19/bsc19776/TutorialPRACE/tutorial_PRACE_starss.ppt


StarSs overview


StarSs

• StarSs

– A “node” level programming model– C/Fortran + directives– Nicely integrates in hybrid MPI/StarSs– Natural support for heterogeneity

• Programmability– Incremental parallelization/restructure– Abstract/separate algorithmic issues from

resources– Disciplined programming

• Portability– “Same” source code runs on “any” machine

• Optimized task implementations will result in better performance.

– “Single source” for maintained version of a application

• Performance– Asynchronous (data-flow) execution and locality

awareness– Intelligent Runtime: specific for each type of

target platform.• Automatically extracts and exploits parallelism• Matches computations to resources

StarSsCellSs

SMPSs

GPUSs

GridSs

ClearSpeedSsClusterSs

CompSs (Java)


History / Strategy

SMPSs V2~2009

NANOS++~2008

GPUSs~2009

CellSs~2006

SMPSs V1~2007

PERMAS~1994

GridSs~2002

CompSs~2007

NANOS~1996


StarSs: a sequential program …

void vadd3 (float A[BS], float B[BS], float C[BS]); void scale_add (float sum, float A[BS], float B[BS]); void accum (float A[BS], float *sum);

for (i=0; i<N; i+=BS) // C=A+B vadd3 ( &A[i], &B[i], &C[i]);...for (i=0; i<N; i+=BS) // sum(C[i]) accum (&C[i], &sum);...for (i=0; i<N; i+=BS) // B=sum*A scale_add (sum, &E[i], &B[i]);...for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]);...for (i=0; i<N; i+=BS) // E=G+F vadd3 (&G[i], &F[i], &E[i]);


StarSs: … taskified …

#pragma css task input(A, B) output(C)void vadd3 (float A[BS], float B[BS], float C[BS]);#pragma css task input(sum, A) inout(B)void scale_add (float sum, float A[BS], float B[BS]);#pragma css task input(A) inout(sum)void accum (float A[BS], float *sum);


1 2 3 4

13 14 15 16

5 6 87

17

9

18

10

19

11

20

12

Color/number: order of task instantiationSome antidependences covered by flow dependences not drawn

Compute dependences @ task instantiation time


WriteDecouplehow we write

formhow it is executed

StarSs: … and executed in a data-flow model

#pragma css task input(A, B) output(C)void vadd3 (float A[BS], float B[BS], float C[BS]);#pragma css task input(sum, A) inout(B)void scale_add (float sum, float A[BS], float B[BS]);#pragma css task input(A) inout(sum)void accum (float A[BS], float *sum);

1 1 1 2

2 2 2 3

2 3 54

7

6

8

6

7

6

8

7


Execute

Color/number: a possible order of task execution


StarSs: the potential of data access information

• Flat global address space seen by programmer

• Flexibility to dynamically traverse dataflow graph “optimizing”

• Concurrency. Critical path

• Memory access: data transfers performed by run time

• Opportunities for

• Prefetch

• Reuse

• Eliminate antidependences (rename)

• Replication management

• Coherency/consistency handled by the runtime


StarSs: … reductions

#pragma css task input(A, B) output(C)void vadd3 (float A[BS], float B[BS], float C[BS]);#pragma css task input(sum, A) inout(B)void scale_add (float sum, float A[BS], float B[BS]);#pragma css task input(A) inout(sum) reduction(sum)void accum (float A[BS], float *sum);

1 1 1 2

2 2 2 3

2 3 3

5

4

6

4

5

4

6

5


Color/number: possible order of task execution

2


StarSs: just a few directives

#pragma css task [input ( parameters ) ] \

[output ( parameters ) ] \

[inout ( parameters )] \

[target device( [cell, smp, cuda] ) ] \

[implements ( task_name ) ] \

[reduction ( parameters ) ] \

[ highpriority ]

#pragma css wait on ( data_address )

#pragma css barrier

#pragma css mutex lock ( variable )

#pragma css mutex unlock( variable )

parameters: parameter [ , parameter ]*parameter: variable_name {[dimension]}*


StarSs Syntax

• Examples: task selection

#pragma css task input(A, B) inout(C)void block_addmultiply( float C[N][N], float A[N][N], float B[N][N] ) { ...

#pragma css task input(A[BS][BS], B[BS][BS]) inout(C[BS][BS])void block_addmultiply( float *C, float *A, float *B ) { ..

#pragma css task input(A[size][size], B[size][size], size) inout(C[size][size])void block_addmultiply( float *C, float *A, float *B, int size ) { ..

Note: task annotations can be before function code or before function declaration, in a header file.

• Examples: waiting for data

#pragma css task input (ref_block, to_comp) output (mse)void are_blocks_equal (float ref_block[BS][BS], float to_comp[BS][BS], float *mse) { ... ...are_blocks_equal (X[ii][jj],Y[ii][jj], &sq_error);#pragma css wait on (sq_error)

if (sq_error >0.0000001){ ...


StarSs Syntax

• Examples: reductions

#pragma css task input (vec, n) inout(results) reduction (results)void sum_task(int *vec, int n, int *results){ int i; int local_sol=0;

for (i = 0; i < n; i++) local_sol += vec[i];#pragma css mutex lock (results) *results = *results + local_sol;#pragma css mutex unlock (results)}

• Examples: reductions. The parameter used in the mutex can be anything that identifies the access uniquely

#pragma css task input (vec, n, id) inout(results) reduction (results)void sum_task(int *vec, int n, int *results, int id){ int i; int local_sol=0;

for (i = 0; i < n; i++) local_sol += vec[i];#pragma css mutex lock (id) *results = *results + local_sol;#pragma css mutex unlock (id)}


StarSs Syntax

• MPI tasks#pragma css task input(partner, bufsend) output(bufrecv) target(comm_thread)void communication(int partner, double bufsend[BLOCK_SIZE], double bufrecv[BLOCK_SIZE]){ int ierr; MPI_Request request[2]; MPI_Status status[2];

ierr = MPI_Isend(bufsend, BLOCK_SIZE, MPI_DOUBLE, partner, 0, MPI_COMM_WORLD, &request[0]);

ierr = MPI_Irecv(bufrecv, BLOCK_SIZE, MPI_DOUBLE, partner, 0, MPI_COMM_WORLD, &request[1]);

MPI_Waitall(2, request, status);

}


StarSs Syntax in Fortran

subroutine example() ... interface !$CSS TASK subroutine block_add_multiply(C, A, B, BS) implicit none integer, intent (in) :: BS real, intent (in) :: A(BS,BS), B(BS,BS) real, intent (inout) :: C(BS,BS) end subroutine end interface ... !$CSS START ... call block_add_multiply(C, A, B, BLOCK_SIZE) ... !$CSS FINISH...end subroutine!$CSS TASKsubroutine block_add_multiply(C, A, B, BS)...end subroutine


StarSs Syntax in Fortran: with MPI

subroutine example() ... interface

!$css task TARGET(COMM_THREAD) subroutine checksum(i, u1, d1, d2, d3) implicit none integer, intent(in) :: i, d1, d2, d3 double complex, intent(in) :: u1(d1*d2*d3)

end subroutine

end interface ...


SMPSs


Slave threads

FUFUFU

SMPSs implementation

IFUREG

ISSIQRENDEC

RETMain thread

CPU0

User mainprogram

SMPSs runtime library

Main thread

Memory

Data dependenceData renaming

Renaming table

...

SchedulingTask execution

GlobalReady task queues

High pri

Low pri

Thread 0Ready task queue

Original task code

SMPSs runtime library

SchedulingTask executionOriginal task code

Worker thread 1


CPU1

Work stealing

SMPSs ru

Original task code

Worker thread 2

CPU2


Work stealing

J.M. Perez, et al, “A Dependency-Aware Task-Based Programming Environment for Multi-Core Architectures” Cluster 2008


SMPss: Compiler phase

Code translation

(mcc)

smpss-cc_app.c pack

C compiler(gcc, icc, ...)

app.tasks (tasks list)

app.c

smpss-cc_app.o

app.o

SMPSS-CC


smpss-cc-app.c

SMPss: Linker phase

app.c

unpack

smpss-cc-app.c

app-adapters.c

execlibSMPSS.so

Linker

glue code generator

app.capp.o

app.tasks

exec-adapters.c

app-adapters.ccsmpss-cc_app.o

C compiler(gcc, icc,...)

exec-registration.c

exec-adapters.oexec-registration.o

SMPSS-CC


SMPSs Programming examples


Programing examples

Kernel Explanation Specific features

Simple example initial simple codes Basic StarSs features

MxM matrix multiply Basic features; CellSs, GPUSs implementations

Cholesky cholesky factorization Irregular graph

N Queens Solution of N Queens Recursion. Memory management. Granularity

Stream memory bandwidth stress Use of barrier

Argon simple MD simulation Example in fortran


Simple examples

#pragma css task input(A, B) inout(C)void dgemm( float C[N][N], float A[N][N], float B[N][N] ) { ...}

#pragma css task input(A[BS][BS], B[BS][BS]) inout(C[BS][BS])void dgemm( float *C, float *A, float *B ) { ...}

#pragma css task input(A, B) inout(C)void dgemm( float C[N][N], float A[N][N], float B[N][N] ) { ...}main() {( ...#pragma css barriert = time();dgemm(A,B,C);dgemm(D,E,F);dgemm(C,F,G);Dgemm(C,E,H);

#pragma css barriert=time() – t;prinft (“result G = %f; time = %d\n”, G[0][0]);}

1 2

34

Where information on size and address comes from

Barriers for timing purposes


Simple examples

#pragma css task input(A, B) inout(C)void dgemm( float C[N][N], float A[N][N], float B[N][N] ) { ...}main() {( ...dgemm(A,B,C);dgemm(D,E,F);dgemm(C,F,G);dgemm(A,D,H);dgemm(C,H,I);

#pragma css waiton (F)prinft (“result F = %f\n”, F[0][0]);

dgemm(H,G,C);

#pragma css barrierprinft (“result C = %f\n”, C[0][0]);}

1 2

35

6

4


MxM on matrix stored by blocks

int main (int argc, char **argv) {int i, j, k;…

initialize(A, B, C);

for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) mm_tile( C[i][j], A[i][k], B[k][j]);}

#pragma css task input(A, B) inout(C)static void mm_tile ( float C[BS][BS], float A[BS][BS], float B[BS][BS]) {int i, j, k;

for (i=0; i < BS; i++) for (j=0; j < BS; j++) for (k=0; k < BS; k++) C[i][j] += A[i][k] * B[k][j];}

Will work on matrices of any size

Will work on any number of cores/devices

BS

BSNB

NB

BS

BS


MxM @ CellSs

BS

BSNB

NB

BS

BS



for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) dgem_64x64( C[i][j], A[i][k], B[k][j]);}

#define StageCBA(OFFSET,OFFB) \{ \ ALIGN8B; \ SPU_FMA(c0_0B,a00,b0_0B,c0_0B); c0_0C = *((volatile vector float *)(ptrC+OFFSET+16)); \ SPU_FMA(c1_0B,a10,b0_0B,c1_0B); c1_0C = *((volatile vector float *)(ptrC+M+OFFSET+16)); \ SPU_FMA(c2_0B,a20,b0_0B,c2_0B); c2_0C = *((volatile vector float *)(ptrC+2*M+OFFSET+16)); \ SPU_FMA(c3_0B,a30,b0_0B,c3_0B); SPU_LNOP; \ SPU_FMA(c0_1B,a00,b0_1B,c0_1B); c3_0C = *((volatile vector float *)(ptrC+3*M+OFFSET+16)); \ SPU_FMA(c1_1B,a10,b0_1B,c1_1B); c0_1C = *((volatile vector float *)(ptrC+OFFSET+20)); \ SPU_FMA(c2_1B,a20,b0_1B,c2_1B); c1_1C = *((volatile vector float *)(ptrC+M+OFFSET+20)); \ SPU_FMA(c3_1B,a30,b0_1B,c3_1B); SPU_LNOP; \ SPU_FMA(c0_0B,a01,b1_0B,c0_0B); c2_1C = *((volatile vector float *)(ptrC+2*M+OFFSET+20)); \…….

static void dgem_64x64(volatile float *blkC, volatile float *blkA, volatile float *blkB){ unsigned int i; volatile float *ptrA, *ptrB, *ptrC;

vector float a0, a1, a2, a3;

vector float a00, a01, a02, a03; vector float a10, a11, a12, a13;…..for(i=0; i<M; i+=4){ ptrA = &blkA[i*M]; ptrB = &blkB[0]; ptrC = &blkC[i*M]; a0 = *((volatile vector float *)(ptrA)); a1 = *((volatile vector float *)(ptrA+M)); a2 = *((volatile vector float *)(ptrA+2*M)); a3 = *((volatile vector float *)(ptrA+3*M)); a00 = spu_shuffle(a0, a0, pat0); a01 = spu_shuffle(a0, a0, pat1); a02 = spu_shuffle(a0, a0, pat2); a03 = spu_shuffle(a0, a0, pat3); a10 = spu_shuffle(a1, a1, pat0); a11 = spu_shuffle(a1, a1, pat1);…… a33 = spu_shuffle(a3, a3, pat3); Loads4RegSetA(0); Ops4RegSetA(); Loads4RegSetB(8); StageCBA(0,0); StageACB(8,0); StageBAC(16,0); StageCBA(24,0); StageACB(32,0); StageBAC(40,0); StageMISC(0,0); StageCBAmod(0,4); StageACB(8,4); StageBAC(16,4); StageCBA(24,4); StageACB(32,4); StageBAC(40,4); StageMISC(4,4); StageCBAmod(0,8); StageACB(8,8);…..

#ifdef SPU_CODE#include <blas_s.h> #endif

#pragma css task input(A[64][64], B[64][64]) inout(C[64][64])void dgemm_64x64(double *C, double *A, double *B);

Leverage existing kernels, libraries,…


MxM @ GPUSs using CUBLAS kernel



for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) mm_tile( C[i][j], A[i][k], B[k][j], BS); }

#pragma css task input(A[NB][NB], B[NB][NB], NB) \ inout(C[NB][NB]) target device(cuda)void mm_tile (float *A, float *B, float *C, int NB){ unsigned char TR = 'T', NT = 'N'; float DONE = 1.0, DMONE = -1.0; float *d_A, *d_B, *d_C;

cublasSgemm (NT, NT, NB, NB, NB, DMONE, A, NB, B, NB, DONE, C, NB);}

BS

BSNB

NB

BS

BS


Cholesky factorization

• Cholesky factorization

• Common matrix operation used to solve normal equations in linear least squares problems.

• Calculates a triangular matrix (L) from a symetric and positive defined matrix A.

Cholesky(A) = L

L · Lt = A

• Different possible implementations, depending on how the matrix is traversed (by rows, by columns, left-looking, right-looking)

• It can be decomposed in block operations



• In each iteration red and blue blocks are updated

• SPOTRF: Computes the Cholesky factorization of the diagonal block .

• STRSM: Computes the column panel

• SSYRK: Computes the row panel

• SGEMM: Updates the rest of the matrix

block_syrk



main (){... for (int j = 0; j < DIM; j++){ for (int k= 0; k< j; k++){ for (int i = j+1; i < DIM; i++){ // A[i,j] = A[i,j] - A[i,k] * (A[j,k])^t css_sgemm_tile( A[i][k], A[j][k], A[i][j] ); } } for (int i = 0; i < j; i++){ // A[j,j] = A[j,j] - A[j,i] * (A[j,i])^t css_ssyrk_tile(A[j][i],A[j][j]); }

// Cholesky Factorization of A[j,j] css_spotrf_tile( A[j][j] ); for (int i = j+1; i < DIM; i++){ // A[i,j] <- A[i,j] = X * (A[j,j])^t css_strsm_tile( A[j][j], A[i][j] ); } }

... for (int i = 0; i < DIM; i++) { for (int j = 0; j < DIM; j++) {#pragma css wait on (A[i][j]) print_block(A[i][j]); } }... }

#pragma css task input(A[64][64], B[64][64]) inout(C[64][64])void sgemm_tile(float *A, float *B, float *C)

#pragma css task input (T[64][64]) inout(B[64][64])void strsm_tile(float *T, float *B)

#pragma css task input(A[64][64]) inout(C[64][64])void ssyrk_tile(float *A, float *C)

#pragma css task inout(A[64][64])void spotrf_tile(float *A)

DIM

DIM

64

64




N Queens

• Compute the number of solutions to the problem of locating N queens on an N N board, without any of them killing any other

N 4 5 6 7 8 9 10 ... 14

Solutions 2 10 4 40 92 352 724 ... 365596

2 4 6 8 3 1 7 5 … …

j


void nqueens(int n, int j, char *a, int *results){

int i;

if (n == j) {/* put good solution in heap, return pointer to it. */if (!find_solution){

find_solution = 1;memcpy(sol, a, n * sizeof(char));

}(*results)++;

} else {/* try each possible position for queen <j> */for (i = 0; i < n; i++) {

a[j] = i;if (ok(j + 1, a)) {

nqueens(n, j + 1, a, results);}

}}

}

...nqueens (n, 0, a, &total_res);...

N Queens: sequential version

int ok(int n, char *a){

int i, j;char p, q;

for (i = 0; i < n; i++) {p = a[i];for (j = i + 1; j < n; j++) {

q = a[j];if (q == p || q == p - (j - i) || q == p + (j - i))

return 0;}

}return 1;

}


Queens strategy in SMPSs

Solution

CUT level

Task

Sequential execution


#pragma css task input (n, j, a[n]) inout (results)\ reduction (results)void nqueens_ser_task(int n, int j, char *a, int *results){

int i;int local_sols=0;

if (n == j) {#pragma css mutex lock (&find_solution)

if (!find_solution){find_solution = 1;memcpy(sol, a, n * sizeof(char));

}#pragma css mutex unlock (&find_solution)

local_sols++;} else { for (i = 0; i < n; i++) {

a[j] = i;if (ok(j + 1, a)) { nqueens_ser_task(n, j + 1, a,

&local_sols);}

}}

#pragma css mutex lock (results)*results = *results + local_sols;

#pragma css mutex unlock (results)}

N Queens

• Manual memory allocation

void nqueens(int n, int j, char *a, int depth) { int i;

for (i = 0; i < n; i++) {a[j] = i;if (ok(j + 1, a)) { if (depth < task_depth) {

nqueens(n, j + 1, a, b, depth + 1); } else {

b= malloc (BOARD_SIZE*SIZEOF(int));mcopy (b, a, j*SIZEOF(int));nqueens_ser_task(n, j + 1, b,

&total_res);

}}

}}


#pragma css task input (n, j, a[n]) inout (results)\ reduction (results)void nqueens_ser_task(int n, int j, char *a, int *results){

int i;int local_sols=0;

if (n == j) {#pragma css mutex lock (&find_solution)

if (!find_solution){find_solution = 1;memcpy(sol, a, n * sizeof(char));

}#pragma css mutex unlock (&find_solution)

local_sols++;} else { for (i = 0; i < n; i++) {

a[j] = i;if (ok(j + 1, a)) { nqueens_ser_task(n, j + 1, a,

&local_sols);}

}}

#pragma css mutex lock (results)*results = *results + local_sols;

#pragma css mutex unlock (results)}

N Queens

• Memory allocation: automatic renames the array

void nqueens(int n, int j, char *a, char *b, int depth) { int i;

for (i = 0; i < n; i++) {a[j] = i;if (ok(j + 1, a)) { add_queen_task(b, j, i, n); if (depth < task_depth) {

nqueens(n, j + 1, a, b, depth + 1); } else {

nqueens_ser_task(n, j + 1, b, &total_res);

}}

}}

#pragma css task input (j, i, n)\ inout (a[n]) highpriorityvoid add_queen_task(char *a, int j, int i, int n){

a[j] = i;}


Stream

• Stream is one of the HPC Challenge benchmark suite

http://icl.cs.utk.edu/hpcc/

• Stream is a simple synthetic benchmark program that measures sustainable memory bandwidth (in GB/s) and the corresponding computation rate for simple vector kernel.

• Original OpenMP version: inserts barriers after each operation has processed all elements of the array

• Two StarSs versions: with and without barriers

• Without barriers the correctness is guaranteed thanks to the data-dependence analysis

Copy: c = a

Scale: b = s*c

Add: c = a + b

Triad: a = b + s*c

Initialize: a, b, c


Stream: mimicking original version

main (){

#pragma css start

tuned_initialization();

#pragma css barrier

scalar = 3.0;

for (k=0; k<NTIMES; k++)}

tuned_STREAM_Copy();

#pragma css barrier

tuned_STREAM_Scale(scalar);

#pragma css barrier

tuned_STREAM_Add();

#pragma css barrier

tuned_STREAM_Triad(scalar);

#pragma css barrier

}

#pragma css finish

}

#pragma css task input (a) output (c)void copy_task(double a[BSIZE], double c[BSIZE]){ int j; for (j=0; j < BSIZE; j++) c[j] = a[j];}void tuned_STREAM_Copy(){ int j; for (j=0; j<N; j+=BSIZE) copy_task (&a[j], &c[j]);}

#pragma css task input (c, scalar) output (b)void scale_task (double b[BSIZE], double c[BSIZE],

double scalar){ int j; for (j=0; j < BSIZE; j++) b[j] = scalar*c[j];}void tuned_STREAM_Scale(double scalar){ int j; for (j=0; j<N; j+=BSIZE) scale_task (&b[j], &c[j], scalar);}


Stream

#pragma css task input (a, b) output (c)void add_task (double a[BSIZE], double b[BSIZE], double c[BSIZE]){ int j; for (j=0; j < BSIZE; j++) c[j] = a[j]+b[j];}void tuned_STREAM_Add(){ int j; for (j=0; j<N; j+=BSIZE) add_task(&a[j], &b[j], &c[j]);}#pragma css task input (b, c, scalar) output (a)void triad_task (double a[BSIZE], double b[BSIZE], double c[BSIZE], double scalar){ int j; for (j=0; j < BSIZE; j++) a[j] = b[j]+scalar*c[j];}void tuned_STREAM_Triad(double scalar){ int j; for (j=0; j<N; j+=BSIZE) triad_task (&a[j], &b[j], &c[j], scalar);}


Stream: version without barriers

• Version without barriers: SMPSs/CellSs data dependence analysis guarantees correctness

main (){

#pragma css start

tuned_initialization();

#pragma css barrier

scalar = 3.0;

for (k=0; k<NTIMES; k++)}

tuned_STREAM_Copy();

tuned_STREAM_Scale(scalar);

tuned_STREAM_Add();

tuned_STREAM_Triad(scalar);

}

#pragma css finish

}

5 6 7 8

9 10 1211

Copy

Scale

Add

Triad

17 18 19 20

Init

1 2 3 4

13 14 15 16

same iteration

next iteration


Molecular dynamics: Argon

• Molecular dynamics: Argon simulation

• Simulates the mobility of Argon atoms in gas state, in a constant volume at T=300K

• All elestrostatic forces observed for each of the atoms against the others are considered (F

i)

• The second Newton law is then applied to each atom

Fi=m*a

i

• The initial velocities are random but reasonable for argon atoms at 300K

• To maintain a constant temperature in all the process the Berendsen algorithm is applied


Molecular dynamics: Argon simulation

program argon...!$CSS START do step=1,niter do ii=1, N, BSIZE do jj=1, N, BSIZE call velocity(BSIZE, ii, jj, x(ii), y(ii),

z(ii), x(jj), y(jj), z(jj), vx(ii),vy(ii), vz(ii))

enddo enddo do jj=1, N, BSIZE call v_mod(BSIZE, v(jj), vx(jj), vy(jj), vz(jj)) enddo!$CSS BARRIER

tins=0.e0 do i=1,N tins=mkg*v(i)**2/3.e0/kb+tins enddo tins=tins/N lam1=sqrt(t/tins) do ii=1, N, BSIZE call update_position(BSIZE, lam1, vx(ii), vy(ii),

vz(ii), x(ii), y(ii), z(ii)) enddo enddo!$CSS FINISHend

program argon... interface !$CSS TASK subroutine velocity(BSIZE, ii, jj, xi, yi, zi, xj, yj, zj,

vx, vy, vz) implicit none integer, intent(in) :: BSIZE, ii, jj real, intent(in), dimension(BSIZE) :: xi, yi, zi, xj, yj, zj real, intent(inout), dimension(BSIZE) :: vx, vy, vz end subroutine !$CSS TASK subroutine update_position(BSIZE, lam1, vx, vy, vz, x, y, z) implicit none integer, intent(in) :: BSIZE real, intent(in) :: lam1 real, intent(inout), dimension(BSIZE) :: vx, vy, vz real, intent(inout), dimension(BSIZE) :: x, y, z end subroutine !$CSS TASK subroutine v_mod(BSIZE, v, vx, vy, vz) implicit none integer, intent(in) :: BSIZE real, intent(out) :: v(BSIZE) real, intent(in), dimension(BSIZE) :: vx, vy, vz end subroutine end interface


Molecular dynamics: Argon simulation

program argon...!$CSS START do step=1,niter do ii=1, N, BSIZE do jj=1, N, BSIZE call velocity(BSIZE, ii, jj, x(ii), y(ii),

z(ii), x(jj), y(jj), z(jj), vx(ii),vy(ii), vz(ii))

enddo enddo do jj=1, N, BSIZE call v_mod(BSIZE, v(jj), vx(jj), vy(jj), vz(jj)) enddo!$CSS BARRIER

tins=0.e0 do i=1,N tins=mkg*v(i)**2/3.e0/kb+tins enddo tins=tins/N lam1=sqrt(t/tins) do ii=1, N, BSIZE call update_position(BSIZE, lam1, vx(ii), vy(ii),

vz(ii), x(ii), y(ii), z(ii)) enddo enddo!$CSS FINISHend

!$CSS TASKsubroutine velocity(BSIZE, ii, jj, xi, yi, zi, xj, yj, zj, vx, vy, vz)! subroutine code end subroutine!$CSS TASKsubroutine update_position(BSIZE, lam1, vx, vy, vz, x, y, z)! subroutine code end subroutine!$CSS TASKsubroutine v_mod(BSIZE, v, vx, vy, vz)! subroutine code end subroutine


Hands-on

• Copy files from:

• /gpfs/scratch/bsc19/bsc19776/TutorialPRACE/tutorial.tar.gz


Extensions to OpenMP


Extensions to OpenMP tasking

E. Ayguade, et all, “A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures” LNCS, IWOMP 2009, IJPP2010

Dependences: not all arguments in directionality clause

Different implementation

s

Heterogeneous devices

Separation dependences/transfers


Sparse LU (data by blocks)

Inline directives: saves manual outlining !!! Tasks have no name not multiple implementations


Heterogeneous Cholesky

Annotated function declaration: ALL instances become tasks


Array sections


Hybrid MPI/SMPSs


A few hybrid examples

Kernel Explanation Specific features

MxM Eurobench mod2am

Pinpong Sample ping pong example

Linpack HPC Challenge benchmark

FT NAS NPB FT benchmark


Mod2am: Original MPI code

for( i = 0; i < processes; i++ ) { stag = i + 1; rtag = stag; indx = (me + nodes - i )%processes; shift = ((offset[indx][0]/BS)*nDIM*BS*BS); size = vsize*l; if(i%2 == 0) { mxm ( lda, sizes[indx][0], lda, hsize, a, b, (c+shift) ); MPI_Sendrecv (a, size, MPI_DOUBLE, down, stag, rbuf, size, MPI_DOUBLE, up, rtag, comm, &stats ); } else { mxm (lda, sizes[indx][0], lda, hsize, rbuf, b, (c+shift) ); MPI_Sendrecv (rbuf, size, MPI_DOUBLE, down, stag, a, size, MPI_DOUBLE, up, rtag, comm, &stats ); }}

C A B

rbuf

void mxm ( int lda, int m, int l, int n, double *a, double *b, double *c ){

double alpha=1.0, beta=1.0;int i, j;char tr = 't';dgemm(&tr, &tr, &m, &n, &l, &alpha, a, &lda, b, &m, &beta, c, &m);

}


for( i = 0; i < processes; i++ ) { stag = i + 1; rtag = stag; indx = (me + nodes - i )%processes; shift = ((offset[indx][0]/BS)*nDIM*BS*BS); size = vsize*l; if(i%2 == 0) { mxm( lda, sizes[indx][0], lda, hsize, a, b, (c+shift) ); callSendRecv (a, size, down, stag, rbuf, up, rtag); } else { mxm (lda, sizes[indx][0], lda, hsize, rbuf, b, (c+shift) ); callSendRecv (rbuf, size, down, stag, a, up, rtag); } }

Mod2am: Overlap communication and computation with StarSs

#pragma css task input (size, down, stag, up, rtag, a[size]) output (rbuf[size])void callSendRecv (double *a, int size, int down, int stag, double *rbuf, int up, int rtag){ MPI_Status stats; MPI_Sendrecv( a, size, MPI_DOUBLE, down, stag, rbuf, size, MPI_DOUBLE, up, rtag, MPI_COMM_WORLD, &stats );}

#pragma css task input( lda, m, l, n, a[m*l], b[l*n]) inout(c[m*n]) void mxm ( int lda, int m, int l, int n, double *a, double *b, double *c ){


}

C A B

rbuf

Parall

elism

!!!!!

com

ing fr

om re

nam

ing


Eurobench mod2am in StarSs

for( i = 0; i < processes; i++ ) { stag = i + 1; rtag = stag; indx = (me + nodes - i )%processes; shift = ((offset[indx][0]/BS)*nDIM*BS*BS); size = vsize*l;

mxm (lda, sizes[indx][0], lda, hsize, a, b, (c+shift) ); callSendRecv (a, size, down, stag, a, up, rtag);}

#pragma css task input( lda, m, l, n, a[m*l], b[l*n]) inout(c[m*n]) void mxm ( int lda, int m, int l, int n, double *a, double *b, double *c ){


}

#pragma css task input (size, down, stag, up, rtag, a[size]) output (rbuf[size])void callSendRecv (double *a, int size, int down, int stag, double *rbuf, int up, int rtag){ MPI_Status stats; MPI_Sendrecv( a, size, MPI_DOUBLE, down, stag, rbuf, size, MPI_DOUBLE, up, rtag, MPI_COMM_WORLD, &stats );}

C A B


Communication thread for MPI calls

CPU

CPU CPU

CPU

memory

NODE

communicationSMPSs tasks

computation SMPSs tasks

Ready-to-run queues

computation SMPSs thread

communicationSMPSs thread

• 1 MPI process per node(N cores)

• N computation threads per node

• 1 communication thread per node

• Communication and computation threads share CPU time

• Communication threads should wake up as soon as possible and do not waste CPU time while it is blocked

• Time sharing optimization by setting OS thread priorities and MPI waiting mode


OS thread priorities & MPI waiting mode

100% 100% use of CPU

100%

MPI calls uses blocking mode and communication thread does not waste resources while it is blocked

computation SMPSs thread

0% 0% use of CPU 0%

communicationSMPSs thread

CPU time

MPI request (~ms)100% use of CPU

End of communication (~us)100% use of CPU

Reducing the priority of the computation threads accelerates execution of the communication thread


Hierarchy: Hybrid MPI/SMPSs

• Linpack example

• Overlap communication/computation

• Extend asynchronous data-flow execution to outer level

• Automatic lookahead

…

for (k=0; k<N; k++) {

if (mine) {

Factor_panel(A[k]);

send (A[k])

} else {

receive (A[k]);

if (necessary) resend (A[k]);

}

for (j=k+1; j<N; j++)

update (A[k], A[j]);

…

#pragma css task inout(A[SIZE])

void Factor_panel(float *A);

#pragma css task input(A[SIZE]) inout(B[SIZE])void update(float *A, float *B);

#pragma css task input(A[SIZE]) target(comm_thread)

void send(float *A);

#pragma css task output(A[SIZE]) target (comm_thread)

void receive(float *A);

#pragma css task input(A[SIZE]) target

(comm_thread)void resend(float *A);

P0 P1 P2


MPI/SMPSs LINPACK

#pragma css task input(buf)output(req)target(comm_thread)void send (<type>buf[N*nb],MPI_Request *req)

#pragma css task output(buf, req) target(comm_thread)void recv (<type>buf[count],MPI_Request *req)

#pragma css task input(A) output(panel)highpriorityvoid fact (double A[N/P][NB], double tmp_panel[N/P][NB]); #pragma css task input(panel) inout(A)void update (double tmp_panel[N/P][NB], double A[N/P][NB]); #define NPANELS (N/NB*Q)#define root (j%Q==my_rank) double A[N/P][NPANELS*NB];double tmp_panel[N/P][NB]; int k=0; for( j = 0; j < N; j += NB ){ if (root){ fact (&A[j][j],tmp_panel);

send (buf, req_send); k++; }else{ recv (buf, req_recv); if (necessary) send (buf, req_forward);

}

for(i = k; i< NPANELS; i++ ) update (tmp_panel, &A[j][i*NB],k);}

Factorization phase contains fine grain communications and it is performed by a single task for computation threads; MPI waiting mode

is “polling”

• Keeps the same structure of the code like pure MPI version without look-ahead technique

• Naturally follows critical path in data-flow way

• Automatically overlaps communication and computation


MPI/SMPSs LINPACK

#pragma css task input(buf)output(req)target(comm_thread)void send (<type>buf[N*nb],MPI_Request *req)

#pragma css task output(buf, req) target(comm_thread)void recv (<type>buf[count],MPI_Request *req)

#pragma css task input(A) output(panel)highpriorityvoid fact (double A[N/P][NB], double tmp_panel[N/P][NB]); #pragma css task input(panel) inout(A)void update (double tmp_panel[N/P][NB], double A[N/P][NB]); #define NPANELS (N/NB*Q)#define root (j%Q==my_rank) double A[N/P][NPANELS*NB];double tmp_panel[N/P][NB]; int k=0; for( j = 0; j < N; j += NB ){ if (root){ fact (&A[j][j],tmp_panel);

send (buf, req_send); k++; }else{ recv (buf, req_recv); if (necessary) send (buf, req_forward);

}

for(i = k; i< NPANELS; i++ ) update (tmp_panel, &A[j][i*NB],k);}

1 MPI process 4 computation threads

1 communication thread

Factorization phase uses 99%cpu time

(polling mode)

Broadcast phase uses < 5% cpu time

(blocking mode)


MPI + SMPSs: the linpack example


do iter = 1, niter

call evolve(u0, u1, twiddle, dims(1,1), dims(2,1), dims(3,1))

call fft(-1, u1, u2)

call checksum(iter, u2, dims(1,1), dims(2,1), dims(3,1))

end do

NAS FT Code

61

subroutine fft(dir, x1, x2)...if (dir .eq. 1) then ...else

...else if (layout_type .eq. layout_1d) then call cffts1(-1, dims(1,3), dims(2,3), dims(3,3), x1, x_help)

l1=3 call transpose2_local (dims(1,l1),dims(2, l1)*dims(3, l1),

x_help, x2)call transpose2_global (dims(1,l1)*dims(2, l1),dims(3, l1), x2,

x_help)call transpose2_finish (dims(1,l1), dims(2, l1)*dims(3, l1),

x_help, x2)call cffts2 (-1, dims(1,2), dims(2,2), dims(3,2), x2, x_help)call cffts1 (-1, dims(1,1), dims(2,1), dims(3,1), x_help, x2)

else if ... ... return end

x_help

x_help

inputinout

output


!$css task

subroutine evolve(u0, u1, twiddle, d1, d2, d3)

implicit none

integer, intent(in) :: d1, d2, d3

double complex, intent(in) :: twiddle(d1*d2*d3)

double complex, intent(inout) :: u0(d1*d2*d3)

double complex, intent(out) :: u1(d1*d2*d3)

end subroutine

Tasks

62

!$css tasksubroutine checksum(i, u1, d1, d2, d3)implicit noneinteger, intent(in) :: i, d1, d2, d3double complex, intent(in) :: u1(d1*d2*d3)end subroutine

!$css task HIGHPRIORITY()

subroutine cffts1(is, d1, d2, d3, x, xout)

implicit none

integer, intent(in) :: is, d1, d2, d3

double complex, intent(in) :: x(d1*d2*d3)

double complex, intent(out) :: xout(d1*d2*d3)

end subroutine


Tasks

63

!$css task HIGHPRIORITY() subroutine cffts2(is, d1, d2, d3, x, xout)implicit noneinteger, intent(in) :: is, d1, d2, d3double complex, intent(in) :: x(d1*d2*d3)double complex, intent(out) ::

xout(d1*d2*d3)end subroutine


subroutine transpose2_local(n1, n2, xin, xout)

implicit none

integer, intent(in) :: n1, n2

double complex, intent(inout) :: xin(n1*n2)

double complex, intent(inout) :: xout(n1*n2)

end subroutine

!$css task TARGET(COMM_THREAD)subroutine transpose2_global(n1, n2, xin, xout)implicit noneinteger, intent(in) :: n1, n2double complex, intent(inout) :: xin(n1*n2)double complex, intent(inout) :: xout(n1*n2)end subroutine


subroutine transpose2_finish(n1, n2, xin, xout)

implicit none

integer, intent(in) :: n1, n2

double complex, intent(inout) :: xin(n1*n2)

double complex, intent(inout) :: xout(n1*n2)

end subroutine


Task graph

64

evolveevolve

cffts1cffts1

cffts2cffts2

transp_localtransp_local

transp_globaltransp_global

cffts1cffts1

check sumcheck sum

...

transp_finishtransp_finish

MPI_all_to_all

evolveevolve

cffts1cffts1

cffts2cffts2



cffts1cffts1

check sumcheck sum


evolveevolve

cffts1cffts1

cffts2cffts2



cffts1cffts1

check sumcheck sum


P0 P1 PN-1

Loop (one iteration per process)


Task graph

65

evolveevolve

cffts1cffts1

cffts2cffts2

transp_local

transp_local

transp_global

transp_global

cffts1cffts1

check sum

check sum

transp_finish

transp_finish

MPI_all_to_all

P0

evolveevolve

cffts1cffts1

cffts2cffts2

transp_local

transp_local

transp_global

transp_global

cffts1cffts1

check sum

check sum

transp_finish

transp_finish

evolveevolve

cffts1cffts1

cffts2cffts2

transp_local

transp_local

transp_global

transp_global

cffts1cffts1

check sum

check sum

transp_finish

transp_finish

MPI_all_to_all

P0

evolveevolve

cffts1cffts1

cffts2cffts2

transp_local

transp_local

transp_global

transp_global

cffts1cffts1

check sum

check sum

transp_finish

transp_finish

evolveevolve

cffts1cffts1

transp_local

transp_local

Several iterations


Performance analysis with Paraver

• Paraver traces are generated both for pure SMPSs and for MPI/SMPSs

• For pure SMPSs:

• Simply compile with “-t” option

• Run as usual, the paraver file is generated

• For MPI/SMPSs applications

• Compile with “-t” option

• Set the LD_PRELOAD variable

export LD_PRELOAD=$MPItrace_PATH/libsmpssmpitrace.so

• More details in the SMPSs manual


Performance analysis

• SMPSs comes with a set of Paraver configuration files

• $(SMPSs)/share/cellss/paraver_cfgs/

• Also, Paraver distribution comes with a set of Paraver configuration files, for example for specific MPI analysis

• Some available configuration files:

• task.cfg: view of the tasks, different color means different task type

• execution_phases.cfg: shows the activity of threads, i.e., percentatge of time in each different phase (adding task, executing a task...)

• task_number.cfg: shows id number of the tasks being executed

• flushing.cfg: shows period when application is writing the tracefile, i.e, in this period the tracefile is not significant

• Task_profile.cfg: Classiffication of tasks versus threads

• 3dh_duration_tasks.cfg: tasks duration histogran, van be viewed by task type


Performance analysis


Conclusion

• StarSs is an active research project

• With a global view on multicore and parallel programming

• Stable enough to start porting relevant production codes at BSC

• With a lot of potential still to be investigated.

Documents

Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC