69
Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Programming with StarSs

Rosa M. BadiaComputer Sciences Research Dept.

BSC

Page 2: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 2

Agenda

• StarSs overview• SMPSs • SMPSs examples• Single node hands-on• Hybrid model MPI/SMPSs• Programming examples• MPI/SMPSs hands-on

• Slides available in:• http://marsa.ac.upc.edu/prace/course_material_starss• /gpfs/scratch/bsc19/bsc19776/TutorialPRACE/tutorial_PRACE_starss.ppt

Page 3: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 3

StarSs overview

Page 4: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 4

StarSs

• StarSs

– A “node” level programming model– C/Fortran + directives– Nicely integrates in hybrid MPI/StarSs– Natural support for heterogeneity

• Programmability– Incremental parallelization/restructure– Abstract/separate algorithmic issues from

resources– Disciplined programming

• Portability– “Same” source code runs on “any” machine

• Optimized task implementations will result in better performance.

– “Single source” for maintained version of a application

• Performance– Asynchronous (data-flow) execution and locality

awareness– Intelligent Runtime: specific for each type of

target platform.• Automatically extracts and exploits parallelism• Matches computations to resources

StarSsCellSs

SMPSs

GPUSs

GridSs

ClearSpeedSsClusterSs

CompSs (Java)

Page 5: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 5

History / Strategy

SMPSs V2~2009

NANOS++~2008

GPUSs~2009

CellSs~2006

SMPSs V1~2007

PERMAS~1994

GridSs~2002

CompSs~2007

NANOS~1996

Page 6: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 6

StarSs: a sequential program …

void vadd3 (float A[BS], float B[BS], float C[BS]); void scale_add (float sum, float A[BS], float B[BS]); void accum (float A[BS], float *sum);

for (i=0; i<N; i+=BS) // C=A+B vadd3 ( &A[i], &B[i], &C[i]);...for (i=0; i<N; i+=BS) // sum(C[i]) accum (&C[i], &sum);...for (i=0; i<N; i+=BS) // B=sum*A scale_add (sum, &E[i], &B[i]);...for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]);...for (i=0; i<N; i+=BS) // E=G+F vadd3 (&G[i], &F[i], &E[i]);

Page 7: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 7

StarSs: … taskified …

#pragma css task input(A, B) output(C)void vadd3 (float A[BS], float B[BS], float C[BS]);#pragma css task input(sum, A) inout(B)void scale_add (float sum, float A[BS], float B[BS]);#pragma css task input(A) inout(sum)void accum (float A[BS], float *sum);

for (i=0; i<N; i+=BS) // C=A+B vadd3 ( &A[i], &B[i], &C[i]);...for (i=0; i<N; i+=BS) // sum(C[i]) accum (&C[i], &sum);...for (i=0; i<N; i+=BS) // B=sum*A scale_add (sum, &E[i], &B[i]);...for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]);...for (i=0; i<N; i+=BS) // E=G+F vadd3 (&G[i], &F[i], &E[i]);

1 2 3 4

13 14 15 16

5 6 87

17

9

18

10

19

11

20

12

Color/number: order of task instantiationSome antidependences covered by flow dependences not drawn

Compute dependences @ task instantiation time

Page 8: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 8

WriteDecouplehow we write

formhow it is executed

StarSs: … and executed in a data-flow model

#pragma css task input(A, B) output(C)void vadd3 (float A[BS], float B[BS], float C[BS]);#pragma css task input(sum, A) inout(B)void scale_add (float sum, float A[BS], float B[BS]);#pragma css task input(A) inout(sum)void accum (float A[BS], float *sum);

1 1 1 2

2 2 2 3

2 3 54

7

6

8

6

7

6

8

7

for (i=0; i<N; i+=BS) // C=A+B vadd3 ( &A[i], &B[i], &C[i]);...for (i=0; i<N; i+=BS) // sum(C[i]) accum (&C[i], &sum);...for (i=0; i<N; i+=BS) // B=sum*A scale_add (sum, &E[i], &B[i]);...for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]);...for (i=0; i<N; i+=BS) // E=G+F vadd3 (&G[i], &F[i], &E[i]);

Execute

Color/number: a possible order of task execution

Page 9: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 9

StarSs: the potential of data access information

• Flat global address space seen by programmer

• Flexibility to dynamically traverse dataflow graph “optimizing”

• Concurrency. Critical path

• Memory access: data transfers performed by run time

• Opportunities for

• Prefetch

• Reuse

• Eliminate antidependences (rename)

• Replication management

• Coherency/consistency handled by the runtime

Page 10: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 10

StarSs: … reductions

#pragma css task input(A, B) output(C)void vadd3 (float A[BS], float B[BS], float C[BS]);#pragma css task input(sum, A) inout(B)void scale_add (float sum, float A[BS], float B[BS]);#pragma css task input(A) inout(sum) reduction(sum)void accum (float A[BS], float *sum);

1 1 1 2

2 2 2 3

2 3 3

5

4

6

4

5

4

6

5

for (i=0; i<N; i+=BS) // C=A+B vadd3 ( &A[i], &B[i], &C[i]);...for (i=0; i<N; i+=BS) // sum(C[i]) accum (&C[i], &sum);...for (i=0; i<N; i+=BS) // B=sum*A scale_add (sum, &E[i], &B[i]);...for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]);...for (i=0; i<N; i+=BS) // E=G+F vadd3 (&G[i], &F[i], &E[i]);

Color/number: possible order of task execution

2

Page 11: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 11

StarSs: just a few directives

#pragma css task [input ( parameters ) ] \

[output ( parameters ) ] \

[inout ( parameters )] \

[target device( [cell, smp, cuda] ) ] \

[implements ( task_name ) ] \

[reduction ( parameters ) ] \

[ highpriority ]

#pragma css wait on ( data_address )

#pragma css barrier

#pragma css mutex lock ( variable )

#pragma css mutex unlock( variable )

parameters: parameter [ , parameter ]*parameter: variable_name {[dimension]}*

Page 12: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 12

StarSs Syntax

• Examples: task selection

#pragma css task input(A, B) inout(C)void block_addmultiply( float C[N][N], float A[N][N], float B[N][N] ) { ...

#pragma css task input(A[BS][BS], B[BS][BS]) inout(C[BS][BS])void block_addmultiply( float *C, float *A, float *B ) { ..

#pragma css task input(A[size][size], B[size][size], size) inout(C[size][size])void block_addmultiply( float *C, float *A, float *B, int size ) { ..

Note: task annotations can be before function code or before function declaration, in a header file.

• Examples: waiting for data

#pragma css task input (ref_block, to_comp) output (mse)void are_blocks_equal (float ref_block[BS][BS], float to_comp[BS][BS], float *mse) { ... ...are_blocks_equal (X[ii][jj],Y[ii][jj], &sq_error);#pragma css wait on (sq_error)

if (sq_error >0.0000001){ ...

Page 13: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 13

StarSs Syntax

• Examples: reductions

#pragma css task input (vec, n) inout(results) reduction (results)void sum_task(int *vec, int n, int *results){ int i; int local_sol=0;

for (i = 0; i < n; i++) local_sol += vec[i];#pragma css mutex lock (results) *results = *results + local_sol;#pragma css mutex unlock (results)}

• Examples: reductions. The parameter used in the mutex can be anything that identifies the access uniquely

#pragma css task input (vec, n, id) inout(results) reduction (results)void sum_task(int *vec, int n, int *results, int id){ int i; int local_sol=0;

for (i = 0; i < n; i++) local_sol += vec[i];#pragma css mutex lock (id) *results = *results + local_sol;#pragma css mutex unlock (id)}

Page 14: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 14

StarSs Syntax

• MPI tasks#pragma css task input(partner, bufsend) output(bufrecv) target(comm_thread)void communication(int partner, double bufsend[BLOCK_SIZE], double bufrecv[BLOCK_SIZE]){ int ierr; MPI_Request request[2]; MPI_Status status[2];

ierr = MPI_Isend(bufsend, BLOCK_SIZE, MPI_DOUBLE, partner, 0, MPI_COMM_WORLD, &request[0]);

ierr = MPI_Irecv(bufrecv, BLOCK_SIZE, MPI_DOUBLE, partner, 0, MPI_COMM_WORLD, &request[1]);

MPI_Waitall(2, request, status);

}

Page 15: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 15

StarSs Syntax in Fortran

subroutine example() ... interface !$CSS TASK subroutine block_add_multiply(C, A, B, BS) implicit none integer, intent (in) :: BS real, intent (in) :: A(BS,BS), B(BS,BS) real, intent (inout) :: C(BS,BS) end subroutine end interface ... !$CSS START ... call block_add_multiply(C, A, B, BLOCK_SIZE) ... !$CSS FINISH...end subroutine!$CSS TASKsubroutine block_add_multiply(C, A, B, BS)...end subroutine

Page 16: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 16

StarSs Syntax in Fortran: with MPI

subroutine example() ... interface

!$css task TARGET(COMM_THREAD) subroutine checksum(i, u1, d1, d2, d3) implicit none integer, intent(in) :: i, d1, d2, d3 double complex, intent(in) :: u1(d1*d2*d3)

end subroutine

end interface ...

Page 17: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 17

SMPSs

Page 18: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 18

Slave threads

FUFUFU

SMPSs implementation

IFUREG

ISSIQRENDEC

RETMain thread

CPU0

User mainprogram

SMPSs runtime library

Main thread

Memory

Data dependenceData renaming

Renaming table

...

SchedulingTask execution

GlobalReady task queues

High pri

Low pri

Thread 0Ready task queue

Original task code

SMPSs runtime library

SchedulingTask executionOriginal task code

Worker thread 1

Thread 1Ready task queue

CPU1

Work stealing

SMPSs ru

Original task code

Worker thread 2

CPU2

Thread 2Ready task queue

Work stealing

J.M. Perez, et al, “A Dependency-Aware Task-Based Programming Environment for Multi-Core Architectures” Cluster 2008

Page 19: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 19

SMPss: Compiler phase

Code translation

(mcc)

smpss-cc_app.c pack

C compiler(gcc, icc, ...)

app.tasks (tasks list)

app.c

smpss-cc_app.o

app.o

SMPSS-CC

Page 20: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 20

smpss-cc-app.c

SMPss: Linker phase

app.c

unpack

smpss-cc-app.c

app-adapters.c

execlibSMPSS.so

Linker

glue code generator

app.capp.o

app.tasks

exec-adapters.c

app-adapters.ccsmpss-cc_app.o

C compiler(gcc, icc,...)

exec-registration.c

exec-adapters.oexec-registration.o

SMPSS-CC

Page 21: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 21

SMPSs Programming examples

Page 22: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 22

Programing examples

Kernel Explanation Specific features

Simple example initial simple codes Basic StarSs features

MxM matrix multiply Basic features; CellSs, GPUSs implementations

Cholesky cholesky factorization Irregular graph

N Queens Solution of N Queens Recursion. Memory management. Granularity

Stream memory bandwidth stress Use of barrier

Argon simple MD simulation Example in fortran

Page 23: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 23

Simple examples

#pragma css task input(A, B) inout(C)void dgemm( float C[N][N], float A[N][N], float B[N][N] ) { ...}

#pragma css task input(A[BS][BS], B[BS][BS]) inout(C[BS][BS])void dgemm( float *C, float *A, float *B ) { ...}

#pragma css task input(A, B) inout(C)void dgemm( float C[N][N], float A[N][N], float B[N][N] ) { ...}main() {( ...#pragma css barriert = time();dgemm(A,B,C);dgemm(D,E,F);dgemm(C,F,G);Dgemm(C,E,H);

#pragma css barriert=time() – t;prinft (“result G = %f; time = %d\n”, G[0][0]);}

1 2

34

Where information on size and address comes from

Barriers for timing purposes

Page 24: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 24

Simple examples

#pragma css task input(A, B) inout(C)void dgemm( float C[N][N], float A[N][N], float B[N][N] ) { ...}main() {( ...dgemm(A,B,C);dgemm(D,E,F);dgemm(C,F,G);dgemm(A,D,H);dgemm(C,H,I);

#pragma css waiton (F)prinft (“result F = %f\n”, F[0][0]);

dgemm(H,G,C);

#pragma css barrierprinft (“result C = %f\n”, C[0][0]);}

1 2

35

6

4

Page 25: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 25

MxM on matrix stored by blocks

int main (int argc, char **argv) {int i, j, k;…

initialize(A, B, C);

for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) mm_tile( C[i][j], A[i][k], B[k][j]);}

#pragma css task input(A, B) inout(C)static void mm_tile ( float C[BS][BS], float A[BS][BS], float B[BS][BS]) {int i, j, k;

for (i=0; i < BS; i++) for (j=0; j < BS; j++) for (k=0; k < BS; k++) C[i][j] += A[i][k] * B[k][j];}

Will work on matrices of any size

Will work on any number of cores/devices

BS

BSNB

NB

BS

BS

Page 26: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 26

MxM @ CellSs

BS

BSNB

NB

BS

BS

int main (int argc, char **argv) {int i, j, k;…

initialize(A, B, C);

for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) dgem_64x64( C[i][j], A[i][k], B[k][j]);}

#define StageCBA(OFFSET,OFFB) \{ \ ALIGN8B; \ SPU_FMA(c0_0B,a00,b0_0B,c0_0B); c0_0C = *((volatile vector float *)(ptrC+OFFSET+16)); \ SPU_FMA(c1_0B,a10,b0_0B,c1_0B); c1_0C = *((volatile vector float *)(ptrC+M+OFFSET+16)); \ SPU_FMA(c2_0B,a20,b0_0B,c2_0B); c2_0C = *((volatile vector float *)(ptrC+2*M+OFFSET+16)); \ SPU_FMA(c3_0B,a30,b0_0B,c3_0B); SPU_LNOP; \ SPU_FMA(c0_1B,a00,b0_1B,c0_1B); c3_0C = *((volatile vector float *)(ptrC+3*M+OFFSET+16)); \ SPU_FMA(c1_1B,a10,b0_1B,c1_1B); c0_1C = *((volatile vector float *)(ptrC+OFFSET+20)); \ SPU_FMA(c2_1B,a20,b0_1B,c2_1B); c1_1C = *((volatile vector float *)(ptrC+M+OFFSET+20)); \ SPU_FMA(c3_1B,a30,b0_1B,c3_1B); SPU_LNOP; \ SPU_FMA(c0_0B,a01,b1_0B,c0_0B); c2_1C = *((volatile vector float *)(ptrC+2*M+OFFSET+20)); \…….

static void dgem_64x64(volatile float *blkC, volatile float *blkA, volatile float *blkB){ unsigned int i; volatile float *ptrA, *ptrB, *ptrC;

vector float a0, a1, a2, a3;

vector float a00, a01, a02, a03; vector float a10, a11, a12, a13;…..for(i=0; i<M; i+=4){ ptrA = &blkA[i*M]; ptrB = &blkB[0]; ptrC = &blkC[i*M]; a0 = *((volatile vector float *)(ptrA)); a1 = *((volatile vector float *)(ptrA+M)); a2 = *((volatile vector float *)(ptrA+2*M)); a3 = *((volatile vector float *)(ptrA+3*M)); a00 = spu_shuffle(a0, a0, pat0); a01 = spu_shuffle(a0, a0, pat1); a02 = spu_shuffle(a0, a0, pat2); a03 = spu_shuffle(a0, a0, pat3); a10 = spu_shuffle(a1, a1, pat0); a11 = spu_shuffle(a1, a1, pat1);…… a33 = spu_shuffle(a3, a3, pat3); Loads4RegSetA(0); Ops4RegSetA(); Loads4RegSetB(8); StageCBA(0,0); StageACB(8,0); StageBAC(16,0); StageCBA(24,0); StageACB(32,0); StageBAC(40,0); StageMISC(0,0); StageCBAmod(0,4); StageACB(8,4); StageBAC(16,4); StageCBA(24,4); StageACB(32,4); StageBAC(40,4); StageMISC(4,4); StageCBAmod(0,8); StageACB(8,8);…..

#ifdef SPU_CODE#include <blas_s.h> #endif

#pragma css task input(A[64][64], B[64][64]) inout(C[64][64])void dgemm_64x64(double *C, double *A, double *B);

Leverage existing kernels, libraries,…

Page 27: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 27

MxM @ GPUSs using CUBLAS kernel

int main (int argc, char **argv) {int i, j, k;…

initialize(A, B, C);

for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) mm_tile( C[i][j], A[i][k], B[k][j], BS); }

#pragma css task input(A[NB][NB], B[NB][NB], NB) \ inout(C[NB][NB]) target device(cuda)void mm_tile (float *A, float *B, float *C, int NB){ unsigned char TR = 'T', NT = 'N'; float DONE = 1.0, DMONE = -1.0; float *d_A, *d_B, *d_C;

cublasSgemm (NT, NT, NB, NB, NB, DMONE, A, NB, B, NB, DONE, C, NB);}

BS

BSNB

NB

BS

BS

Page 28: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 28

Cholesky factorization

• Cholesky factorization

• Common matrix operation used to solve normal equations in linear least squares problems.

• Calculates a triangular matrix (L) from a symetric and positive defined matrix A.

Cholesky(A) = L

L · Lt = A

• Different possible implementations, depending on how the matrix is traversed (by rows, by columns, left-looking, right-looking)

• It can be decomposed in block operations

Page 29: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 29

Cholesky factorization

• In each iteration red and blue blocks are updated

• SPOTRF: Computes the Cholesky factorization of the diagonal block .

• STRSM: Computes the column panel

• SSYRK: Computes the row panel

• SGEMM: Updates the rest of the matrix

block_syrk

Page 30: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 30

Cholesky factorization

main (){... for (int j = 0; j < DIM; j++){ for (int k= 0; k< j; k++){ for (int i = j+1; i < DIM; i++){ // A[i,j] = A[i,j] - A[i,k] * (A[j,k])^t css_sgemm_tile( A[i][k], A[j][k], A[i][j] ); } } for (int i = 0; i < j; i++){ // A[j,j] = A[j,j] - A[j,i] * (A[j,i])^t css_ssyrk_tile(A[j][i],A[j][j]); }

// Cholesky Factorization of A[j,j] css_spotrf_tile( A[j][j] ); for (int i = j+1; i < DIM; i++){ // A[i,j] <- A[i,j] = X * (A[j,j])^t css_strsm_tile( A[j][j], A[i][j] ); } }

... for (int i = 0; i < DIM; i++) { for (int j = 0; j < DIM; j++) {#pragma css wait on (A[i][j]) print_block(A[i][j]); } }... }

#pragma css task input(A[64][64], B[64][64]) inout(C[64][64])void sgemm_tile(float *A, float *B, float *C)

#pragma css task input (T[64][64]) inout(B[64][64])void strsm_tile(float *T, float *B)

#pragma css task input(A[64][64]) inout(C[64][64])void ssyrk_tile(float *A, float *C)

#pragma css task inout(A[64][64])void spotrf_tile(float *A)

DIM

DIM

64

64

Page 31: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 31

Cholesky factorization

Page 32: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 32

N Queens

• Compute the number of solutions to the problem of locating N queens on an N N board, without any of them killing any other

N 4 5 6 7 8 9 10 ... 14

Solutions 2 10 4 40 92 352 724 ... 365596

2 4 6 8 3 1 7 5 … …

j

Page 33: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 33

void nqueens(int n, int j, char *a, int *results){

int i;

if (n == j) {/* put good solution in heap, return pointer to it. */if (!find_solution){

find_solution = 1;memcpy(sol, a, n * sizeof(char));

}(*results)++;

} else {/* try each possible position for queen <j> */for (i = 0; i < n; i++) {

a[j] = i;if (ok(j + 1, a)) {

nqueens(n, j + 1, a, results);}

}}

}

...nqueens (n, 0, a, &total_res);...

N Queens: sequential version

int ok(int n, char *a){

int i, j;char p, q;

for (i = 0; i < n; i++) {p = a[i];for (j = i + 1; j < n; j++) {

q = a[j];if (q == p || q == p - (j - i) || q == p + (j - i))

return 0;}

}return 1;

}

Page 34: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 34

Queens strategy in SMPSs

Solution

CUT level

Task

Sequential execution

Page 35: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 35

#pragma css task input (n, j, a[n]) inout (results)\ reduction (results)void nqueens_ser_task(int n, int j, char *a, int *results){

int i;int local_sols=0;

if (n == j) {#pragma css mutex lock (&find_solution)

if (!find_solution){find_solution = 1;memcpy(sol, a, n * sizeof(char));

}#pragma css mutex unlock (&find_solution)

local_sols++;} else { for (i = 0; i < n; i++) {

a[j] = i;if (ok(j + 1, a)) { nqueens_ser_task(n, j + 1, a,

&local_sols);}

}}

#pragma css mutex lock (results)*results = *results + local_sols;

#pragma css mutex unlock (results)}

N Queens

• Manual memory allocation

void nqueens(int n, int j, char *a, int depth) { int i;

for (i = 0; i < n; i++) {a[j] = i;if (ok(j + 1, a)) { if (depth < task_depth) {

nqueens(n, j + 1, a, b, depth + 1); } else {

b= malloc (BOARD_SIZE*SIZEOF(int));mcopy (b, a, j*SIZEOF(int));nqueens_ser_task(n, j + 1, b,

&total_res);

}}

}}

Page 36: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 36

#pragma css task input (n, j, a[n]) inout (results)\ reduction (results)void nqueens_ser_task(int n, int j, char *a, int *results){

int i;int local_sols=0;

if (n == j) {#pragma css mutex lock (&find_solution)

if (!find_solution){find_solution = 1;memcpy(sol, a, n * sizeof(char));

}#pragma css mutex unlock (&find_solution)

local_sols++;} else { for (i = 0; i < n; i++) {

a[j] = i;if (ok(j + 1, a)) { nqueens_ser_task(n, j + 1, a,

&local_sols);}

}}

#pragma css mutex lock (results)*results = *results + local_sols;

#pragma css mutex unlock (results)}

N Queens

• Memory allocation: automatic renames the array

void nqueens(int n, int j, char *a, char *b, int depth) { int i;

for (i = 0; i < n; i++) {a[j] = i;if (ok(j + 1, a)) { add_queen_task(b, j, i, n); if (depth < task_depth) {

nqueens(n, j + 1, a, b, depth + 1); } else {

nqueens_ser_task(n, j + 1, b, &total_res);

}}

}}

#pragma css task input (j, i, n)\ inout (a[n]) highpriorityvoid add_queen_task(char *a, int j, int i, int n){

a[j] = i;}

Page 37: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 37

Stream

• Stream is one of the HPC Challenge benchmark suite

http://icl.cs.utk.edu/hpcc/

• Stream is a simple synthetic benchmark program that measures sustainable memory bandwidth (in GB/s) and the corresponding computation rate for simple vector kernel.

• Original OpenMP version: inserts barriers after each operation has processed all elements of the array

• Two StarSs versions: with and without barriers

• Without barriers the correctness is guaranteed thanks to the data-dependence analysis

Copy: c = a

Scale: b = s*c

Add: c = a + b

Triad: a = b + s*c

Initialize: a, b, c

Page 38: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 38

Stream: mimicking original version

main (){

#pragma css start

tuned_initialization();

#pragma css barrier

scalar = 3.0;

for (k=0; k<NTIMES; k++)}

tuned_STREAM_Copy();

#pragma css barrier

tuned_STREAM_Scale(scalar);

#pragma css barrier

tuned_STREAM_Add();

#pragma css barrier

tuned_STREAM_Triad(scalar);

#pragma css barrier

}

#pragma css finish

}

#pragma css task input (a) output (c)void copy_task(double a[BSIZE], double c[BSIZE]){ int j; for (j=0; j < BSIZE; j++) c[j] = a[j];}void tuned_STREAM_Copy(){ int j; for (j=0; j<N; j+=BSIZE) copy_task (&a[j], &c[j]);}

#pragma css task input (c, scalar) output (b)void scale_task (double b[BSIZE], double c[BSIZE],

double scalar){ int j; for (j=0; j < BSIZE; j++) b[j] = scalar*c[j];}void tuned_STREAM_Scale(double scalar){ int j; for (j=0; j<N; j+=BSIZE) scale_task (&b[j], &c[j], scalar);}

Page 39: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 39

Stream

#pragma css task input (a, b) output (c)void add_task (double a[BSIZE], double b[BSIZE], double c[BSIZE]){ int j; for (j=0; j < BSIZE; j++) c[j] = a[j]+b[j];}void tuned_STREAM_Add(){ int j; for (j=0; j<N; j+=BSIZE) add_task(&a[j], &b[j], &c[j]);}#pragma css task input (b, c, scalar) output (a)void triad_task (double a[BSIZE], double b[BSIZE], double c[BSIZE], double scalar){ int j; for (j=0; j < BSIZE; j++) a[j] = b[j]+scalar*c[j];}void tuned_STREAM_Triad(double scalar){ int j; for (j=0; j<N; j+=BSIZE) triad_task (&a[j], &b[j], &c[j], scalar);}

Page 40: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 40

Stream: version without barriers

• Version without barriers: SMPSs/CellSs data dependence analysis guarantees correctness

main (){

#pragma css start

tuned_initialization();

#pragma css barrier

scalar = 3.0;

for (k=0; k<NTIMES; k++)}

tuned_STREAM_Copy();

tuned_STREAM_Scale(scalar);

tuned_STREAM_Add();

tuned_STREAM_Triad(scalar);

}

#pragma css finish

}

5 6 7 8

9 10 1211

Copy

Scale

Add

Triad

17 18 19 20

Init

1 2 3 4

13 14 15 16

same iteration

next iteration

Page 41: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 41

Molecular dynamics: Argon

• Molecular dynamics: Argon simulation

• Simulates the mobility of Argon atoms in gas state, in a constant volume at T=300K

• All elestrostatic forces observed for each of the atoms against the others are considered (F

i)

• The second Newton law is then applied to each atom

Fi=m*a

i

• The initial velocities are random but reasonable for argon atoms at 300K

• To maintain a constant temperature in all the process the Berendsen algorithm is applied

Page 42: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 42

Molecular dynamics: Argon simulation

program argon...!$CSS START do step=1,niter do ii=1, N, BSIZE do jj=1, N, BSIZE call velocity(BSIZE, ii, jj, x(ii), y(ii),

z(ii), x(jj), y(jj), z(jj), vx(ii),vy(ii), vz(ii))

enddo enddo do jj=1, N, BSIZE call v_mod(BSIZE, v(jj), vx(jj), vy(jj), vz(jj)) enddo!$CSS BARRIER

tins=0.e0 do i=1,N tins=mkg*v(i)**2/3.e0/kb+tins enddo tins=tins/N lam1=sqrt(t/tins) do ii=1, N, BSIZE call update_position(BSIZE, lam1, vx(ii), vy(ii),

vz(ii), x(ii), y(ii), z(ii)) enddo enddo!$CSS FINISHend

program argon... interface !$CSS TASK subroutine velocity(BSIZE, ii, jj, xi, yi, zi, xj, yj, zj,

vx, vy, vz) implicit none integer, intent(in) :: BSIZE, ii, jj real, intent(in), dimension(BSIZE) :: xi, yi, zi, xj, yj, zj real, intent(inout), dimension(BSIZE) :: vx, vy, vz end subroutine !$CSS TASK subroutine update_position(BSIZE, lam1, vx, vy, vz, x, y, z) implicit none integer, intent(in) :: BSIZE real, intent(in) :: lam1 real, intent(inout), dimension(BSIZE) :: vx, vy, vz real, intent(inout), dimension(BSIZE) :: x, y, z end subroutine !$CSS TASK subroutine v_mod(BSIZE, v, vx, vy, vz) implicit none integer, intent(in) :: BSIZE real, intent(out) :: v(BSIZE) real, intent(in), dimension(BSIZE) :: vx, vy, vz end subroutine end interface

Page 43: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 43

Molecular dynamics: Argon simulation

program argon...!$CSS START do step=1,niter do ii=1, N, BSIZE do jj=1, N, BSIZE call velocity(BSIZE, ii, jj, x(ii), y(ii),

z(ii), x(jj), y(jj), z(jj), vx(ii),vy(ii), vz(ii))

enddo enddo do jj=1, N, BSIZE call v_mod(BSIZE, v(jj), vx(jj), vy(jj), vz(jj)) enddo!$CSS BARRIER

tins=0.e0 do i=1,N tins=mkg*v(i)**2/3.e0/kb+tins enddo tins=tins/N lam1=sqrt(t/tins) do ii=1, N, BSIZE call update_position(BSIZE, lam1, vx(ii), vy(ii),

vz(ii), x(ii), y(ii), z(ii)) enddo enddo!$CSS FINISHend

!$CSS TASKsubroutine velocity(BSIZE, ii, jj, xi, yi, zi, xj, yj, zj, vx, vy, vz)! subroutine code end subroutine!$CSS TASKsubroutine update_position(BSIZE, lam1, vx, vy, vz, x, y, z)! subroutine code end subroutine!$CSS TASKsubroutine v_mod(BSIZE, v, vx, vy, vz)! subroutine code end subroutine

Page 44: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 44

Hands-on

• Copy files from:

• /gpfs/scratch/bsc19/bsc19776/TutorialPRACE/tutorial.tar.gz

Page 45: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 45

Extensions to OpenMP

Page 46: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 46

Extensions to OpenMP tasking

E. Ayguade, et all, “A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures” LNCS, IWOMP 2009, IJPP2010

Dependences: not all arguments in directionality clause

Different implementation

s

Heterogeneous devices

Separation dependences/transfers

Page 47: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 47

Sparse LU (data by blocks)

Inline directives: saves manual outlining !!! Tasks have no name not multiple implementations

Page 48: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 48

Heterogeneous Cholesky

Annotated function declaration: ALL instances become tasks

Page 49: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 49

Array sections

Page 50: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 50

Hybrid MPI/SMPSs

Page 51: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 51

A few hybrid examples

Kernel Explanation Specific features

MxM Eurobench mod2am

Pinpong Sample ping pong example

Linpack HPC Challenge benchmark

FT NAS NPB FT benchmark

Page 52: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 52

Mod2am: Original MPI code

for( i = 0; i < processes; i++ ) { stag = i + 1; rtag = stag; indx = (me + nodes - i )%processes; shift = ((offset[indx][0]/BS)*nDIM*BS*BS); size = vsize*l; if(i%2 == 0) { mxm ( lda, sizes[indx][0], lda, hsize, a, b, (c+shift) ); MPI_Sendrecv (a, size, MPI_DOUBLE, down, stag, rbuf, size, MPI_DOUBLE, up, rtag, comm, &stats ); } else { mxm (lda, sizes[indx][0], lda, hsize, rbuf, b, (c+shift) ); MPI_Sendrecv (rbuf, size, MPI_DOUBLE, down, stag, a, size, MPI_DOUBLE, up, rtag, comm, &stats ); }}

C A B

rbuf

void mxm ( int lda, int m, int l, int n, double *a, double *b, double *c ){

double alpha=1.0, beta=1.0;int i, j;char tr = 't';dgemm(&tr, &tr, &m, &n, &l, &alpha, a, &lda, b, &m, &beta, c, &m);

}

Page 53: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 53

for( i = 0; i < processes; i++ ) { stag = i + 1; rtag = stag; indx = (me + nodes - i )%processes; shift = ((offset[indx][0]/BS)*nDIM*BS*BS); size = vsize*l; if(i%2 == 0) { mxm( lda, sizes[indx][0], lda, hsize, a, b, (c+shift) ); callSendRecv (a, size, down, stag, rbuf, up, rtag); } else { mxm (lda, sizes[indx][0], lda, hsize, rbuf, b, (c+shift) ); callSendRecv (rbuf, size, down, stag, a, up, rtag); } }

Mod2am: Overlap communication and computation with StarSs

#pragma css task input (size, down, stag, up, rtag, a[size]) output (rbuf[size])void callSendRecv (double *a, int size, int down, int stag, double *rbuf, int up, int rtag){ MPI_Status stats; MPI_Sendrecv( a, size, MPI_DOUBLE, down, stag, rbuf, size, MPI_DOUBLE, up, rtag, MPI_COMM_WORLD, &stats );}

#pragma css task input( lda, m, l, n, a[m*l], b[l*n]) inout(c[m*n]) void mxm ( int lda, int m, int l, int n, double *a, double *b, double *c ){

double alpha=1.0, beta=1.0;int i, j;char tr = 't';dgemm(&tr, &tr, &m, &n, &l, &alpha, a, &lda, b, &m, &beta, c, &m);

}

C A B

rbuf

Parall

elism

!!!!!

com

ing fr

om re

nam

ing

Page 54: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 54

Eurobench mod2am in StarSs

for( i = 0; i < processes; i++ ) { stag = i + 1; rtag = stag; indx = (me + nodes - i )%processes; shift = ((offset[indx][0]/BS)*nDIM*BS*BS); size = vsize*l;

mxm (lda, sizes[indx][0], lda, hsize, a, b, (c+shift) ); callSendRecv (a, size, down, stag, a, up, rtag);}

#pragma css task input( lda, m, l, n, a[m*l], b[l*n]) inout(c[m*n]) void mxm ( int lda, int m, int l, int n, double *a, double *b, double *c ){

double alpha=1.0, beta=1.0;int i, j;char tr = 't';dgemm(&tr, &tr, &m, &n, &l, &alpha, a, &lda, b, &m, &beta, c, &m);

}

#pragma css task input (size, down, stag, up, rtag, a[size]) output (rbuf[size])void callSendRecv (double *a, int size, int down, int stag, double *rbuf, int up, int rtag){ MPI_Status stats; MPI_Sendrecv( a, size, MPI_DOUBLE, down, stag, rbuf, size, MPI_DOUBLE, up, rtag, MPI_COMM_WORLD, &stats );}

C A B

Page 55: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 55

Communication thread for MPI calls

CPU

CPU CPU

CPU

memory

NODE

communicationSMPSs tasks

computation SMPSs tasks

Ready-to-run queues

computation SMPSs thread

communicationSMPSs thread

• 1 MPI process per node(N cores)

• N computation threads per node

• 1 communication thread per node

• Communication and computation threads share CPU time

• Communication threads should wake up as soon as possible and do not waste CPU time while it is blocked

• Time sharing optimization by setting OS thread priorities and MPI waiting mode

Page 56: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 56

OS thread priorities & MPI waiting mode

100% 100% use of CPU

100%

MPI calls uses blocking mode and communication thread does not waste resources while it is blocked

computation SMPSs thread

0% 0% use of CPU 0%

communicationSMPSs thread

CPU time

MPI request (~ms)100% use of CPU

End of communication (~us)100% use of CPU

Reducing the priority of the computation threads accelerates execution of the communication thread

Page 57: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 57

Hierarchy: Hybrid MPI/SMPSs

• Linpack example

• Overlap communication/computation

• Extend asynchronous data-flow execution to outer level

• Automatic lookahead

for (k=0; k<N; k++) {

if (mine) {

Factor_panel(A[k]);

send (A[k])

} else {

receive (A[k]);

if (necessary) resend (A[k]);

}

for (j=k+1; j<N; j++)

update (A[k], A[j]);

#pragma css task inout(A[SIZE])

void Factor_panel(float *A);

#pragma css task input(A[SIZE]) inout(B[SIZE])void update(float *A, float *B);

#pragma css task input(A[SIZE]) target(comm_thread)

void send(float *A);

#pragma css task output(A[SIZE]) target (comm_thread)

void receive(float *A);

#pragma css task input(A[SIZE]) target

(comm_thread)void resend(float *A);

P0 P1 P2

Page 58: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 58

MPI/SMPSs LINPACK

#pragma css task input(buf)output(req)target(comm_thread)void send (<type>buf[N*nb],MPI_Request *req)

#pragma css task output(buf, req) target(comm_thread)void recv (<type>buf[count],MPI_Request *req)

#pragma css task input(A) output(panel)highpriorityvoid fact (double A[N/P][NB], double tmp_panel[N/P][NB]); #pragma css task input(panel) inout(A)void update (double tmp_panel[N/P][NB], double A[N/P][NB]); #define NPANELS (N/NB*Q)#define root (j%Q==my_rank) double A[N/P][NPANELS*NB];double tmp_panel[N/P][NB]; int k=0; for( j = 0; j < N; j += NB ){ if (root){ fact (&A[j][j],tmp_panel);

send (buf, req_send); k++; }else{ recv (buf, req_recv); if (necessary) send (buf, req_forward);

}

for(i = k; i< NPANELS; i++ ) update (tmp_panel, &A[j][i*NB],k);}

Factorization phase contains fine grain communications and it is performed by a single task for computation threads; MPI waiting mode

is “polling”

• Keeps the same structure of the code like pure MPI version without look-ahead technique

• Naturally follows critical path in data-flow way

• Automatically overlaps communication and computation

Page 59: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 59

MPI/SMPSs LINPACK

#pragma css task input(buf)output(req)target(comm_thread)void send (<type>buf[N*nb],MPI_Request *req)

#pragma css task output(buf, req) target(comm_thread)void recv (<type>buf[count],MPI_Request *req)

#pragma css task input(A) output(panel)highpriorityvoid fact (double A[N/P][NB], double tmp_panel[N/P][NB]); #pragma css task input(panel) inout(A)void update (double tmp_panel[N/P][NB], double A[N/P][NB]); #define NPANELS (N/NB*Q)#define root (j%Q==my_rank) double A[N/P][NPANELS*NB];double tmp_panel[N/P][NB]; int k=0; for( j = 0; j < N; j += NB ){ if (root){ fact (&A[j][j],tmp_panel);

send (buf, req_send); k++; }else{ recv (buf, req_recv); if (necessary) send (buf, req_forward);

}

for(i = k; i< NPANELS; i++ ) update (tmp_panel, &A[j][i*NB],k);}

1 MPI process 4 computation threads

1 communication thread

Factorization phase uses 99%cpu time

(polling mode)

Broadcast phase uses < 5% cpu time

(blocking mode)

Page 60: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 60

MPI + SMPSs: the linpack example

Page 61: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 61

do iter = 1, niter

call evolve(u0, u1, twiddle, dims(1,1), dims(2,1), dims(3,1))

call fft(-1, u1, u2)

call checksum(iter, u2, dims(1,1), dims(2,1), dims(3,1))

end do

NAS FT Code

61

subroutine fft(dir, x1, x2)...if (dir .eq. 1) then ...else

...else if (layout_type .eq. layout_1d) then call cffts1(-1, dims(1,3), dims(2,3), dims(3,3), x1, x_help)

l1=3 call transpose2_local (dims(1,l1),dims(2, l1)*dims(3, l1),

x_help, x2)call transpose2_global (dims(1,l1)*dims(2, l1),dims(3, l1), x2,

x_help)call transpose2_finish (dims(1,l1), dims(2, l1)*dims(3, l1),

x_help, x2)call cffts2 (-1, dims(1,2), dims(2,2), dims(3,2), x2, x_help)call cffts1 (-1, dims(1,1), dims(2,1), dims(3,1), x_help, x2)

else if ... ... return end

x_help

x_help

inputinout

output

Page 62: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 62

!$css task

subroutine evolve(u0, u1, twiddle, d1, d2, d3)

implicit none

integer, intent(in) :: d1, d2, d3

double complex, intent(in) :: twiddle(d1*d2*d3)

double complex, intent(inout) :: u0(d1*d2*d3)

double complex, intent(out) :: u1(d1*d2*d3)

end subroutine

Tasks

62

!$css tasksubroutine checksum(i, u1, d1, d2, d3)implicit noneinteger, intent(in) :: i, d1, d2, d3double complex, intent(in) :: u1(d1*d2*d3)end subroutine

!$css task HIGHPRIORITY()

subroutine cffts1(is, d1, d2, d3, x, xout)

implicit none

integer, intent(in) :: is, d1, d2, d3

double complex, intent(in) :: x(d1*d2*d3)

double complex, intent(out) :: xout(d1*d2*d3)

end subroutine

Page 63: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 63

Tasks

63

!$css task HIGHPRIORITY() subroutine cffts2(is, d1, d2, d3, x, xout)implicit noneinteger, intent(in) :: is, d1, d2, d3double complex, intent(in) :: x(d1*d2*d3)double complex, intent(out) ::

xout(d1*d2*d3)end subroutine

!$css task HIGHPRIORITY()

subroutine transpose2_local(n1, n2, xin, xout)

implicit none

integer, intent(in) :: n1, n2

double complex, intent(inout) :: xin(n1*n2)

double complex, intent(inout) :: xout(n1*n2)

end subroutine

!$css task TARGET(COMM_THREAD)subroutine transpose2_global(n1, n2, xin, xout)implicit noneinteger, intent(in) :: n1, n2double complex, intent(inout) :: xin(n1*n2)double complex, intent(inout) :: xout(n1*n2)end subroutine

!$css task HIGHPRIORITY()

subroutine transpose2_finish(n1, n2, xin, xout)

implicit none

integer, intent(in) :: n1, n2

double complex, intent(inout) :: xin(n1*n2)

double complex, intent(inout) :: xout(n1*n2)

end subroutine

Page 64: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 64

Task graph

64

evolveevolve

cffts1cffts1

cffts2cffts2

transp_localtransp_local

transp_globaltransp_global

cffts1cffts1

check sumcheck sum

...

transp_finishtransp_finish

MPI_all_to_all

evolveevolve

cffts1cffts1

cffts2cffts2

transp_localtransp_local

transp_globaltransp_global

cffts1cffts1

check sumcheck sum

transp_finishtransp_finish

evolveevolve

cffts1cffts1

cffts2cffts2

transp_localtransp_local

transp_globaltransp_global

cffts1cffts1

check sumcheck sum

transp_finishtransp_finish

P0 P1 PN-1

Loop (one iteration per process)

Page 65: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 65

Task graph

65

evolveevolve

cffts1cffts1

cffts2cffts2

transp_local

transp_local

transp_global

transp_global

cffts1cffts1

check sum

check sum

transp_finish

transp_finish

MPI_all_to_all

P0

evolveevolve

cffts1cffts1

cffts2cffts2

transp_local

transp_local

transp_global

transp_global

cffts1cffts1

check sum

check sum

transp_finish

transp_finish

evolveevolve

cffts1cffts1

cffts2cffts2

transp_local

transp_local

transp_global

transp_global

cffts1cffts1

check sum

check sum

transp_finish

transp_finish

MPI_all_to_all

P0

evolveevolve

cffts1cffts1

cffts2cffts2

transp_local

transp_local

transp_global

transp_global

cffts1cffts1

check sum

check sum

transp_finish

transp_finish

evolveevolve

cffts1cffts1

transp_local

transp_local

Several iterations

Page 66: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 66

Performance analysis with Paraver

• Paraver traces are generated both for pure SMPSs and for MPI/SMPSs

• For pure SMPSs:

• Simply compile with “-t” option

• Run as usual, the paraver file is generated

• For MPI/SMPSs applications

• Compile with “-t” option

• Set the LD_PRELOAD variable

export LD_PRELOAD=$MPItrace_PATH/libsmpssmpitrace.so

• More details in the SMPSs manual

Page 67: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 67

Performance analysis

• SMPSs comes with a set of Paraver configuration files

• $(SMPSs)/share/cellss/paraver_cfgs/

• Also, Paraver distribution comes with a set of Paraver configuration files, for example for specific MPI analysis

• Some available configuration files:

• task.cfg: view of the tasks, different color means different task type

• execution_phases.cfg: shows the activity of threads, i.e., percentatge of time in each different phase (adding task, executing a task...)

• task_number.cfg: shows id number of the tasks being executed

• flushing.cfg: shows period when application is writing the tracefile, i.e, in this period the tracefile is not significant

• Task_profile.cfg: Classiffication of tasks versus threads

• 3dh_duration_tasks.cfg: tasks duration histogran, van be viewed by task type

Page 68: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 68

Performance analysis

Page 69: Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 69

Conclusion

• StarSs is an active research project

• With a global view on multicore and parallel programming

• Stable enough to start porting relevant production codes at BSC

• With a lot of potential still to be investigated.