View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Programming with StarSs
Rosa M. BadiaComputer Sciences Research Dept.
BSC
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 2
Agenda
• StarSs overview• SMPSs • SMPSs examples• Single node hands-on• Hybrid model MPI/SMPSs• Programming examples• MPI/SMPSs hands-on
• Slides available in:• http://marsa.ac.upc.edu/prace/course_material_starss• /gpfs/scratch/bsc19/bsc19776/TutorialPRACE/tutorial_PRACE_starss.ppt
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 3
StarSs overview
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 4
StarSs
• StarSs
– A “node” level programming model– C/Fortran + directives– Nicely integrates in hybrid MPI/StarSs– Natural support for heterogeneity
• Programmability– Incremental parallelization/restructure– Abstract/separate algorithmic issues from
resources– Disciplined programming
• Portability– “Same” source code runs on “any” machine
• Optimized task implementations will result in better performance.
– “Single source” for maintained version of a application
• Performance– Asynchronous (data-flow) execution and locality
awareness– Intelligent Runtime: specific for each type of
target platform.• Automatically extracts and exploits parallelism• Matches computations to resources
StarSsCellSs
SMPSs
GPUSs
GridSs
ClearSpeedSsClusterSs
CompSs (Java)
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 5
History / Strategy
SMPSs V2~2009
NANOS++~2008
GPUSs~2009
CellSs~2006
SMPSs V1~2007
PERMAS~1994
GridSs~2002
CompSs~2007
NANOS~1996
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 6
StarSs: a sequential program …
void vadd3 (float A[BS], float B[BS], float C[BS]); void scale_add (float sum, float A[BS], float B[BS]); void accum (float A[BS], float *sum);
for (i=0; i<N; i+=BS) // C=A+B vadd3 ( &A[i], &B[i], &C[i]);...for (i=0; i<N; i+=BS) // sum(C[i]) accum (&C[i], &sum);...for (i=0; i<N; i+=BS) // B=sum*A scale_add (sum, &E[i], &B[i]);...for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]);...for (i=0; i<N; i+=BS) // E=G+F vadd3 (&G[i], &F[i], &E[i]);
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 7
StarSs: … taskified …
#pragma css task input(A, B) output(C)void vadd3 (float A[BS], float B[BS], float C[BS]);#pragma css task input(sum, A) inout(B)void scale_add (float sum, float A[BS], float B[BS]);#pragma css task input(A) inout(sum)void accum (float A[BS], float *sum);
for (i=0; i<N; i+=BS) // C=A+B vadd3 ( &A[i], &B[i], &C[i]);...for (i=0; i<N; i+=BS) // sum(C[i]) accum (&C[i], &sum);...for (i=0; i<N; i+=BS) // B=sum*A scale_add (sum, &E[i], &B[i]);...for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]);...for (i=0; i<N; i+=BS) // E=G+F vadd3 (&G[i], &F[i], &E[i]);
1 2 3 4
13 14 15 16
5 6 87
17
9
18
10
19
11
20
12
Color/number: order of task instantiationSome antidependences covered by flow dependences not drawn
Compute dependences @ task instantiation time
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 8
WriteDecouplehow we write
formhow it is executed
StarSs: … and executed in a data-flow model
#pragma css task input(A, B) output(C)void vadd3 (float A[BS], float B[BS], float C[BS]);#pragma css task input(sum, A) inout(B)void scale_add (float sum, float A[BS], float B[BS]);#pragma css task input(A) inout(sum)void accum (float A[BS], float *sum);
1 1 1 2
2 2 2 3
2 3 54
7
6
8
6
7
6
8
7
for (i=0; i<N; i+=BS) // C=A+B vadd3 ( &A[i], &B[i], &C[i]);...for (i=0; i<N; i+=BS) // sum(C[i]) accum (&C[i], &sum);...for (i=0; i<N; i+=BS) // B=sum*A scale_add (sum, &E[i], &B[i]);...for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]);...for (i=0; i<N; i+=BS) // E=G+F vadd3 (&G[i], &F[i], &E[i]);
Execute
Color/number: a possible order of task execution
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 9
StarSs: the potential of data access information
• Flat global address space seen by programmer
• Flexibility to dynamically traverse dataflow graph “optimizing”
• Concurrency. Critical path
• Memory access: data transfers performed by run time
• Opportunities for
• Prefetch
• Reuse
• Eliminate antidependences (rename)
• Replication management
• Coherency/consistency handled by the runtime
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 10
StarSs: … reductions
#pragma css task input(A, B) output(C)void vadd3 (float A[BS], float B[BS], float C[BS]);#pragma css task input(sum, A) inout(B)void scale_add (float sum, float A[BS], float B[BS]);#pragma css task input(A) inout(sum) reduction(sum)void accum (float A[BS], float *sum);
1 1 1 2
2 2 2 3
2 3 3
5
4
6
4
5
4
6
5
for (i=0; i<N; i+=BS) // C=A+B vadd3 ( &A[i], &B[i], &C[i]);...for (i=0; i<N; i+=BS) // sum(C[i]) accum (&C[i], &sum);...for (i=0; i<N; i+=BS) // B=sum*A scale_add (sum, &E[i], &B[i]);...for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]);...for (i=0; i<N; i+=BS) // E=G+F vadd3 (&G[i], &F[i], &E[i]);
Color/number: possible order of task execution
2
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 11
StarSs: just a few directives
#pragma css task [input ( parameters ) ] \
[output ( parameters ) ] \
[inout ( parameters )] \
[target device( [cell, smp, cuda] ) ] \
[implements ( task_name ) ] \
[reduction ( parameters ) ] \
[ highpriority ]
#pragma css wait on ( data_address )
#pragma css barrier
#pragma css mutex lock ( variable )
#pragma css mutex unlock( variable )
parameters: parameter [ , parameter ]*parameter: variable_name {[dimension]}*
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 12
StarSs Syntax
• Examples: task selection
#pragma css task input(A, B) inout(C)void block_addmultiply( float C[N][N], float A[N][N], float B[N][N] ) { ...
#pragma css task input(A[BS][BS], B[BS][BS]) inout(C[BS][BS])void block_addmultiply( float *C, float *A, float *B ) { ..
#pragma css task input(A[size][size], B[size][size], size) inout(C[size][size])void block_addmultiply( float *C, float *A, float *B, int size ) { ..
Note: task annotations can be before function code or before function declaration, in a header file.
• Examples: waiting for data
#pragma css task input (ref_block, to_comp) output (mse)void are_blocks_equal (float ref_block[BS][BS], float to_comp[BS][BS], float *mse) { ... ...are_blocks_equal (X[ii][jj],Y[ii][jj], &sq_error);#pragma css wait on (sq_error)
if (sq_error >0.0000001){ ...
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 13
StarSs Syntax
• Examples: reductions
#pragma css task input (vec, n) inout(results) reduction (results)void sum_task(int *vec, int n, int *results){ int i; int local_sol=0;
for (i = 0; i < n; i++) local_sol += vec[i];#pragma css mutex lock (results) *results = *results + local_sol;#pragma css mutex unlock (results)}
• Examples: reductions. The parameter used in the mutex can be anything that identifies the access uniquely
#pragma css task input (vec, n, id) inout(results) reduction (results)void sum_task(int *vec, int n, int *results, int id){ int i; int local_sol=0;
for (i = 0; i < n; i++) local_sol += vec[i];#pragma css mutex lock (id) *results = *results + local_sol;#pragma css mutex unlock (id)}
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 14
StarSs Syntax
• MPI tasks#pragma css task input(partner, bufsend) output(bufrecv) target(comm_thread)void communication(int partner, double bufsend[BLOCK_SIZE], double bufrecv[BLOCK_SIZE]){ int ierr; MPI_Request request[2]; MPI_Status status[2];
ierr = MPI_Isend(bufsend, BLOCK_SIZE, MPI_DOUBLE, partner, 0, MPI_COMM_WORLD, &request[0]);
ierr = MPI_Irecv(bufrecv, BLOCK_SIZE, MPI_DOUBLE, partner, 0, MPI_COMM_WORLD, &request[1]);
MPI_Waitall(2, request, status);
}
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 15
StarSs Syntax in Fortran
subroutine example() ... interface !$CSS TASK subroutine block_add_multiply(C, A, B, BS) implicit none integer, intent (in) :: BS real, intent (in) :: A(BS,BS), B(BS,BS) real, intent (inout) :: C(BS,BS) end subroutine end interface ... !$CSS START ... call block_add_multiply(C, A, B, BLOCK_SIZE) ... !$CSS FINISH...end subroutine!$CSS TASKsubroutine block_add_multiply(C, A, B, BS)...end subroutine
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 16
StarSs Syntax in Fortran: with MPI
subroutine example() ... interface
!$css task TARGET(COMM_THREAD) subroutine checksum(i, u1, d1, d2, d3) implicit none integer, intent(in) :: i, d1, d2, d3 double complex, intent(in) :: u1(d1*d2*d3)
end subroutine
end interface ...
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 17
SMPSs
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 18
Slave threads
FUFUFU
SMPSs implementation
IFUREG
ISSIQRENDEC
RETMain thread
CPU0
User mainprogram
SMPSs runtime library
Main thread
Memory
Data dependenceData renaming
Renaming table
...
SchedulingTask execution
GlobalReady task queues
High pri
Low pri
Thread 0Ready task queue
Original task code
SMPSs runtime library
SchedulingTask executionOriginal task code
Worker thread 1
Thread 1Ready task queue
CPU1
Work stealing
SMPSs ru
Original task code
Worker thread 2
CPU2
Thread 2Ready task queue
Work stealing
J.M. Perez, et al, “A Dependency-Aware Task-Based Programming Environment for Multi-Core Architectures” Cluster 2008
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 19
SMPss: Compiler phase
Code translation
(mcc)
smpss-cc_app.c pack
C compiler(gcc, icc, ...)
app.tasks (tasks list)
app.c
smpss-cc_app.o
app.o
SMPSS-CC
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 20
smpss-cc-app.c
SMPss: Linker phase
app.c
unpack
smpss-cc-app.c
app-adapters.c
execlibSMPSS.so
Linker
glue code generator
app.capp.o
app.tasks
exec-adapters.c
app-adapters.ccsmpss-cc_app.o
C compiler(gcc, icc,...)
exec-registration.c
exec-adapters.oexec-registration.o
SMPSS-CC
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 21
SMPSs Programming examples
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 22
Programing examples
Kernel Explanation Specific features
Simple example initial simple codes Basic StarSs features
MxM matrix multiply Basic features; CellSs, GPUSs implementations
Cholesky cholesky factorization Irregular graph
N Queens Solution of N Queens Recursion. Memory management. Granularity
Stream memory bandwidth stress Use of barrier
Argon simple MD simulation Example in fortran
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 23
Simple examples
#pragma css task input(A, B) inout(C)void dgemm( float C[N][N], float A[N][N], float B[N][N] ) { ...}
#pragma css task input(A[BS][BS], B[BS][BS]) inout(C[BS][BS])void dgemm( float *C, float *A, float *B ) { ...}
#pragma css task input(A, B) inout(C)void dgemm( float C[N][N], float A[N][N], float B[N][N] ) { ...}main() {( ...#pragma css barriert = time();dgemm(A,B,C);dgemm(D,E,F);dgemm(C,F,G);Dgemm(C,E,H);
#pragma css barriert=time() – t;prinft (“result G = %f; time = %d\n”, G[0][0]);}
1 2
34
Where information on size and address comes from
Barriers for timing purposes
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 24
Simple examples
#pragma css task input(A, B) inout(C)void dgemm( float C[N][N], float A[N][N], float B[N][N] ) { ...}main() {( ...dgemm(A,B,C);dgemm(D,E,F);dgemm(C,F,G);dgemm(A,D,H);dgemm(C,H,I);
#pragma css waiton (F)prinft (“result F = %f\n”, F[0][0]);
dgemm(H,G,C);
#pragma css barrierprinft (“result C = %f\n”, C[0][0]);}
1 2
35
6
4
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 25
MxM on matrix stored by blocks
int main (int argc, char **argv) {int i, j, k;…
initialize(A, B, C);
for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) mm_tile( C[i][j], A[i][k], B[k][j]);}
#pragma css task input(A, B) inout(C)static void mm_tile ( float C[BS][BS], float A[BS][BS], float B[BS][BS]) {int i, j, k;
for (i=0; i < BS; i++) for (j=0; j < BS; j++) for (k=0; k < BS; k++) C[i][j] += A[i][k] * B[k][j];}
Will work on matrices of any size
Will work on any number of cores/devices
BS
BSNB
NB
BS
BS
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 26
MxM @ CellSs
BS
BSNB
NB
BS
BS
int main (int argc, char **argv) {int i, j, k;…
initialize(A, B, C);
for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) dgem_64x64( C[i][j], A[i][k], B[k][j]);}
#define StageCBA(OFFSET,OFFB) \{ \ ALIGN8B; \ SPU_FMA(c0_0B,a00,b0_0B,c0_0B); c0_0C = *((volatile vector float *)(ptrC+OFFSET+16)); \ SPU_FMA(c1_0B,a10,b0_0B,c1_0B); c1_0C = *((volatile vector float *)(ptrC+M+OFFSET+16)); \ SPU_FMA(c2_0B,a20,b0_0B,c2_0B); c2_0C = *((volatile vector float *)(ptrC+2*M+OFFSET+16)); \ SPU_FMA(c3_0B,a30,b0_0B,c3_0B); SPU_LNOP; \ SPU_FMA(c0_1B,a00,b0_1B,c0_1B); c3_0C = *((volatile vector float *)(ptrC+3*M+OFFSET+16)); \ SPU_FMA(c1_1B,a10,b0_1B,c1_1B); c0_1C = *((volatile vector float *)(ptrC+OFFSET+20)); \ SPU_FMA(c2_1B,a20,b0_1B,c2_1B); c1_1C = *((volatile vector float *)(ptrC+M+OFFSET+20)); \ SPU_FMA(c3_1B,a30,b0_1B,c3_1B); SPU_LNOP; \ SPU_FMA(c0_0B,a01,b1_0B,c0_0B); c2_1C = *((volatile vector float *)(ptrC+2*M+OFFSET+20)); \…….
static void dgem_64x64(volatile float *blkC, volatile float *blkA, volatile float *blkB){ unsigned int i; volatile float *ptrA, *ptrB, *ptrC;
vector float a0, a1, a2, a3;
vector float a00, a01, a02, a03; vector float a10, a11, a12, a13;…..for(i=0; i<M; i+=4){ ptrA = &blkA[i*M]; ptrB = &blkB[0]; ptrC = &blkC[i*M]; a0 = *((volatile vector float *)(ptrA)); a1 = *((volatile vector float *)(ptrA+M)); a2 = *((volatile vector float *)(ptrA+2*M)); a3 = *((volatile vector float *)(ptrA+3*M)); a00 = spu_shuffle(a0, a0, pat0); a01 = spu_shuffle(a0, a0, pat1); a02 = spu_shuffle(a0, a0, pat2); a03 = spu_shuffle(a0, a0, pat3); a10 = spu_shuffle(a1, a1, pat0); a11 = spu_shuffle(a1, a1, pat1);…… a33 = spu_shuffle(a3, a3, pat3); Loads4RegSetA(0); Ops4RegSetA(); Loads4RegSetB(8); StageCBA(0,0); StageACB(8,0); StageBAC(16,0); StageCBA(24,0); StageACB(32,0); StageBAC(40,0); StageMISC(0,0); StageCBAmod(0,4); StageACB(8,4); StageBAC(16,4); StageCBA(24,4); StageACB(32,4); StageBAC(40,4); StageMISC(4,4); StageCBAmod(0,8); StageACB(8,8);…..
#ifdef SPU_CODE#include <blas_s.h> #endif
#pragma css task input(A[64][64], B[64][64]) inout(C[64][64])void dgemm_64x64(double *C, double *A, double *B);
Leverage existing kernels, libraries,…
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 27
MxM @ GPUSs using CUBLAS kernel
int main (int argc, char **argv) {int i, j, k;…
initialize(A, B, C);
for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) mm_tile( C[i][j], A[i][k], B[k][j], BS); }
#pragma css task input(A[NB][NB], B[NB][NB], NB) \ inout(C[NB][NB]) target device(cuda)void mm_tile (float *A, float *B, float *C, int NB){ unsigned char TR = 'T', NT = 'N'; float DONE = 1.0, DMONE = -1.0; float *d_A, *d_B, *d_C;
cublasSgemm (NT, NT, NB, NB, NB, DMONE, A, NB, B, NB, DONE, C, NB);}
BS
BSNB
NB
BS
BS
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 28
Cholesky factorization
• Cholesky factorization
• Common matrix operation used to solve normal equations in linear least squares problems.
• Calculates a triangular matrix (L) from a symetric and positive defined matrix A.
Cholesky(A) = L
L · Lt = A
• Different possible implementations, depending on how the matrix is traversed (by rows, by columns, left-looking, right-looking)
• It can be decomposed in block operations
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 29
Cholesky factorization
• In each iteration red and blue blocks are updated
• SPOTRF: Computes the Cholesky factorization of the diagonal block .
• STRSM: Computes the column panel
• SSYRK: Computes the row panel
• SGEMM: Updates the rest of the matrix
block_syrk
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 30
Cholesky factorization
main (){... for (int j = 0; j < DIM; j++){ for (int k= 0; k< j; k++){ for (int i = j+1; i < DIM; i++){ // A[i,j] = A[i,j] - A[i,k] * (A[j,k])^t css_sgemm_tile( A[i][k], A[j][k], A[i][j] ); } } for (int i = 0; i < j; i++){ // A[j,j] = A[j,j] - A[j,i] * (A[j,i])^t css_ssyrk_tile(A[j][i],A[j][j]); }
// Cholesky Factorization of A[j,j] css_spotrf_tile( A[j][j] ); for (int i = j+1; i < DIM; i++){ // A[i,j] <- A[i,j] = X * (A[j,j])^t css_strsm_tile( A[j][j], A[i][j] ); } }
... for (int i = 0; i < DIM; i++) { for (int j = 0; j < DIM; j++) {#pragma css wait on (A[i][j]) print_block(A[i][j]); } }... }
#pragma css task input(A[64][64], B[64][64]) inout(C[64][64])void sgemm_tile(float *A, float *B, float *C)
#pragma css task input (T[64][64]) inout(B[64][64])void strsm_tile(float *T, float *B)
#pragma css task input(A[64][64]) inout(C[64][64])void ssyrk_tile(float *A, float *C)
#pragma css task inout(A[64][64])void spotrf_tile(float *A)
DIM
DIM
64
64
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 31
Cholesky factorization
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 32
N Queens
• Compute the number of solutions to the problem of locating N queens on an N N board, without any of them killing any other
N 4 5 6 7 8 9 10 ... 14
Solutions 2 10 4 40 92 352 724 ... 365596
2 4 6 8 3 1 7 5 … …
j
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 33
void nqueens(int n, int j, char *a, int *results){
int i;
if (n == j) {/* put good solution in heap, return pointer to it. */if (!find_solution){
find_solution = 1;memcpy(sol, a, n * sizeof(char));
}(*results)++;
} else {/* try each possible position for queen <j> */for (i = 0; i < n; i++) {
a[j] = i;if (ok(j + 1, a)) {
nqueens(n, j + 1, a, results);}
}}
}
...nqueens (n, 0, a, &total_res);...
N Queens: sequential version
int ok(int n, char *a){
int i, j;char p, q;
for (i = 0; i < n; i++) {p = a[i];for (j = i + 1; j < n; j++) {
q = a[j];if (q == p || q == p - (j - i) || q == p + (j - i))
return 0;}
}return 1;
}
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 34
Queens strategy in SMPSs
Solution
CUT level
Task
Sequential execution
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 35
#pragma css task input (n, j, a[n]) inout (results)\ reduction (results)void nqueens_ser_task(int n, int j, char *a, int *results){
int i;int local_sols=0;
if (n == j) {#pragma css mutex lock (&find_solution)
if (!find_solution){find_solution = 1;memcpy(sol, a, n * sizeof(char));
}#pragma css mutex unlock (&find_solution)
local_sols++;} else { for (i = 0; i < n; i++) {
a[j] = i;if (ok(j + 1, a)) { nqueens_ser_task(n, j + 1, a,
&local_sols);}
}}
#pragma css mutex lock (results)*results = *results + local_sols;
#pragma css mutex unlock (results)}
N Queens
• Manual memory allocation
void nqueens(int n, int j, char *a, int depth) { int i;
for (i = 0; i < n; i++) {a[j] = i;if (ok(j + 1, a)) { if (depth < task_depth) {
nqueens(n, j + 1, a, b, depth + 1); } else {
b= malloc (BOARD_SIZE*SIZEOF(int));mcopy (b, a, j*SIZEOF(int));nqueens_ser_task(n, j + 1, b,
&total_res);
}}
}}
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 36
#pragma css task input (n, j, a[n]) inout (results)\ reduction (results)void nqueens_ser_task(int n, int j, char *a, int *results){
int i;int local_sols=0;
if (n == j) {#pragma css mutex lock (&find_solution)
if (!find_solution){find_solution = 1;memcpy(sol, a, n * sizeof(char));
}#pragma css mutex unlock (&find_solution)
local_sols++;} else { for (i = 0; i < n; i++) {
a[j] = i;if (ok(j + 1, a)) { nqueens_ser_task(n, j + 1, a,
&local_sols);}
}}
#pragma css mutex lock (results)*results = *results + local_sols;
#pragma css mutex unlock (results)}
N Queens
• Memory allocation: automatic renames the array
void nqueens(int n, int j, char *a, char *b, int depth) { int i;
for (i = 0; i < n; i++) {a[j] = i;if (ok(j + 1, a)) { add_queen_task(b, j, i, n); if (depth < task_depth) {
nqueens(n, j + 1, a, b, depth + 1); } else {
nqueens_ser_task(n, j + 1, b, &total_res);
}}
}}
#pragma css task input (j, i, n)\ inout (a[n]) highpriorityvoid add_queen_task(char *a, int j, int i, int n){
a[j] = i;}
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 37
Stream
• Stream is one of the HPC Challenge benchmark suite
http://icl.cs.utk.edu/hpcc/
• Stream is a simple synthetic benchmark program that measures sustainable memory bandwidth (in GB/s) and the corresponding computation rate for simple vector kernel.
• Original OpenMP version: inserts barriers after each operation has processed all elements of the array
• Two StarSs versions: with and without barriers
• Without barriers the correctness is guaranteed thanks to the data-dependence analysis
Copy: c = a
Scale: b = s*c
Add: c = a + b
Triad: a = b + s*c
Initialize: a, b, c
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 38
Stream: mimicking original version
main (){
#pragma css start
tuned_initialization();
#pragma css barrier
scalar = 3.0;
for (k=0; k<NTIMES; k++)}
tuned_STREAM_Copy();
#pragma css barrier
tuned_STREAM_Scale(scalar);
#pragma css barrier
tuned_STREAM_Add();
#pragma css barrier
tuned_STREAM_Triad(scalar);
#pragma css barrier
}
#pragma css finish
}
#pragma css task input (a) output (c)void copy_task(double a[BSIZE], double c[BSIZE]){ int j; for (j=0; j < BSIZE; j++) c[j] = a[j];}void tuned_STREAM_Copy(){ int j; for (j=0; j<N; j+=BSIZE) copy_task (&a[j], &c[j]);}
#pragma css task input (c, scalar) output (b)void scale_task (double b[BSIZE], double c[BSIZE],
double scalar){ int j; for (j=0; j < BSIZE; j++) b[j] = scalar*c[j];}void tuned_STREAM_Scale(double scalar){ int j; for (j=0; j<N; j+=BSIZE) scale_task (&b[j], &c[j], scalar);}
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 39
Stream
#pragma css task input (a, b) output (c)void add_task (double a[BSIZE], double b[BSIZE], double c[BSIZE]){ int j; for (j=0; j < BSIZE; j++) c[j] = a[j]+b[j];}void tuned_STREAM_Add(){ int j; for (j=0; j<N; j+=BSIZE) add_task(&a[j], &b[j], &c[j]);}#pragma css task input (b, c, scalar) output (a)void triad_task (double a[BSIZE], double b[BSIZE], double c[BSIZE], double scalar){ int j; for (j=0; j < BSIZE; j++) a[j] = b[j]+scalar*c[j];}void tuned_STREAM_Triad(double scalar){ int j; for (j=0; j<N; j+=BSIZE) triad_task (&a[j], &b[j], &c[j], scalar);}
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 40
Stream: version without barriers
• Version without barriers: SMPSs/CellSs data dependence analysis guarantees correctness
main (){
#pragma css start
tuned_initialization();
#pragma css barrier
scalar = 3.0;
for (k=0; k<NTIMES; k++)}
tuned_STREAM_Copy();
tuned_STREAM_Scale(scalar);
tuned_STREAM_Add();
tuned_STREAM_Triad(scalar);
}
#pragma css finish
}
5 6 7 8
9 10 1211
Copy
Scale
Add
Triad
17 18 19 20
Init
1 2 3 4
13 14 15 16
same iteration
next iteration
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 41
Molecular dynamics: Argon
• Molecular dynamics: Argon simulation
• Simulates the mobility of Argon atoms in gas state, in a constant volume at T=300K
• All elestrostatic forces observed for each of the atoms against the others are considered (F
i)
• The second Newton law is then applied to each atom
Fi=m*a
i
• The initial velocities are random but reasonable for argon atoms at 300K
• To maintain a constant temperature in all the process the Berendsen algorithm is applied
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 42
Molecular dynamics: Argon simulation
program argon...!$CSS START do step=1,niter do ii=1, N, BSIZE do jj=1, N, BSIZE call velocity(BSIZE, ii, jj, x(ii), y(ii),
z(ii), x(jj), y(jj), z(jj), vx(ii),vy(ii), vz(ii))
enddo enddo do jj=1, N, BSIZE call v_mod(BSIZE, v(jj), vx(jj), vy(jj), vz(jj)) enddo!$CSS BARRIER
tins=0.e0 do i=1,N tins=mkg*v(i)**2/3.e0/kb+tins enddo tins=tins/N lam1=sqrt(t/tins) do ii=1, N, BSIZE call update_position(BSIZE, lam1, vx(ii), vy(ii),
vz(ii), x(ii), y(ii), z(ii)) enddo enddo!$CSS FINISHend
program argon... interface !$CSS TASK subroutine velocity(BSIZE, ii, jj, xi, yi, zi, xj, yj, zj,
vx, vy, vz) implicit none integer, intent(in) :: BSIZE, ii, jj real, intent(in), dimension(BSIZE) :: xi, yi, zi, xj, yj, zj real, intent(inout), dimension(BSIZE) :: vx, vy, vz end subroutine !$CSS TASK subroutine update_position(BSIZE, lam1, vx, vy, vz, x, y, z) implicit none integer, intent(in) :: BSIZE real, intent(in) :: lam1 real, intent(inout), dimension(BSIZE) :: vx, vy, vz real, intent(inout), dimension(BSIZE) :: x, y, z end subroutine !$CSS TASK subroutine v_mod(BSIZE, v, vx, vy, vz) implicit none integer, intent(in) :: BSIZE real, intent(out) :: v(BSIZE) real, intent(in), dimension(BSIZE) :: vx, vy, vz end subroutine end interface
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 43
Molecular dynamics: Argon simulation
program argon...!$CSS START do step=1,niter do ii=1, N, BSIZE do jj=1, N, BSIZE call velocity(BSIZE, ii, jj, x(ii), y(ii),
z(ii), x(jj), y(jj), z(jj), vx(ii),vy(ii), vz(ii))
enddo enddo do jj=1, N, BSIZE call v_mod(BSIZE, v(jj), vx(jj), vy(jj), vz(jj)) enddo!$CSS BARRIER
tins=0.e0 do i=1,N tins=mkg*v(i)**2/3.e0/kb+tins enddo tins=tins/N lam1=sqrt(t/tins) do ii=1, N, BSIZE call update_position(BSIZE, lam1, vx(ii), vy(ii),
vz(ii), x(ii), y(ii), z(ii)) enddo enddo!$CSS FINISHend
!$CSS TASKsubroutine velocity(BSIZE, ii, jj, xi, yi, zi, xj, yj, zj, vx, vy, vz)! subroutine code end subroutine!$CSS TASKsubroutine update_position(BSIZE, lam1, vx, vy, vz, x, y, z)! subroutine code end subroutine!$CSS TASKsubroutine v_mod(BSIZE, v, vx, vy, vz)! subroutine code end subroutine
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 44
Hands-on
• Copy files from:
• /gpfs/scratch/bsc19/bsc19776/TutorialPRACE/tutorial.tar.gz
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 45
Extensions to OpenMP
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 46
Extensions to OpenMP tasking
E. Ayguade, et all, “A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures” LNCS, IWOMP 2009, IJPP2010
Dependences: not all arguments in directionality clause
Different implementation
s
Heterogeneous devices
Separation dependences/transfers
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 47
Sparse LU (data by blocks)
Inline directives: saves manual outlining !!! Tasks have no name not multiple implementations
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 48
Heterogeneous Cholesky
Annotated function declaration: ALL instances become tasks
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 49
Array sections
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 50
Hybrid MPI/SMPSs
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 51
A few hybrid examples
Kernel Explanation Specific features
MxM Eurobench mod2am
Pinpong Sample ping pong example
Linpack HPC Challenge benchmark
FT NAS NPB FT benchmark
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 52
Mod2am: Original MPI code
for( i = 0; i < processes; i++ ) { stag = i + 1; rtag = stag; indx = (me + nodes - i )%processes; shift = ((offset[indx][0]/BS)*nDIM*BS*BS); size = vsize*l; if(i%2 == 0) { mxm ( lda, sizes[indx][0], lda, hsize, a, b, (c+shift) ); MPI_Sendrecv (a, size, MPI_DOUBLE, down, stag, rbuf, size, MPI_DOUBLE, up, rtag, comm, &stats ); } else { mxm (lda, sizes[indx][0], lda, hsize, rbuf, b, (c+shift) ); MPI_Sendrecv (rbuf, size, MPI_DOUBLE, down, stag, a, size, MPI_DOUBLE, up, rtag, comm, &stats ); }}
C A B
rbuf
void mxm ( int lda, int m, int l, int n, double *a, double *b, double *c ){
double alpha=1.0, beta=1.0;int i, j;char tr = 't';dgemm(&tr, &tr, &m, &n, &l, &alpha, a, &lda, b, &m, &beta, c, &m);
}
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 53
for( i = 0; i < processes; i++ ) { stag = i + 1; rtag = stag; indx = (me + nodes - i )%processes; shift = ((offset[indx][0]/BS)*nDIM*BS*BS); size = vsize*l; if(i%2 == 0) { mxm( lda, sizes[indx][0], lda, hsize, a, b, (c+shift) ); callSendRecv (a, size, down, stag, rbuf, up, rtag); } else { mxm (lda, sizes[indx][0], lda, hsize, rbuf, b, (c+shift) ); callSendRecv (rbuf, size, down, stag, a, up, rtag); } }
Mod2am: Overlap communication and computation with StarSs
#pragma css task input (size, down, stag, up, rtag, a[size]) output (rbuf[size])void callSendRecv (double *a, int size, int down, int stag, double *rbuf, int up, int rtag){ MPI_Status stats; MPI_Sendrecv( a, size, MPI_DOUBLE, down, stag, rbuf, size, MPI_DOUBLE, up, rtag, MPI_COMM_WORLD, &stats );}
#pragma css task input( lda, m, l, n, a[m*l], b[l*n]) inout(c[m*n]) void mxm ( int lda, int m, int l, int n, double *a, double *b, double *c ){
double alpha=1.0, beta=1.0;int i, j;char tr = 't';dgemm(&tr, &tr, &m, &n, &l, &alpha, a, &lda, b, &m, &beta, c, &m);
}
C A B
rbuf
Parall
elism
!!!!!
com
ing fr
om re
nam
ing
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 54
Eurobench mod2am in StarSs
for( i = 0; i < processes; i++ ) { stag = i + 1; rtag = stag; indx = (me + nodes - i )%processes; shift = ((offset[indx][0]/BS)*nDIM*BS*BS); size = vsize*l;
mxm (lda, sizes[indx][0], lda, hsize, a, b, (c+shift) ); callSendRecv (a, size, down, stag, a, up, rtag);}
#pragma css task input( lda, m, l, n, a[m*l], b[l*n]) inout(c[m*n]) void mxm ( int lda, int m, int l, int n, double *a, double *b, double *c ){
double alpha=1.0, beta=1.0;int i, j;char tr = 't';dgemm(&tr, &tr, &m, &n, &l, &alpha, a, &lda, b, &m, &beta, c, &m);
}
#pragma css task input (size, down, stag, up, rtag, a[size]) output (rbuf[size])void callSendRecv (double *a, int size, int down, int stag, double *rbuf, int up, int rtag){ MPI_Status stats; MPI_Sendrecv( a, size, MPI_DOUBLE, down, stag, rbuf, size, MPI_DOUBLE, up, rtag, MPI_COMM_WORLD, &stats );}
C A B
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 55
Communication thread for MPI calls
CPU
CPU CPU
CPU
memory
NODE
communicationSMPSs tasks
computation SMPSs tasks
Ready-to-run queues
computation SMPSs thread
communicationSMPSs thread
• 1 MPI process per node(N cores)
• N computation threads per node
• 1 communication thread per node
• Communication and computation threads share CPU time
• Communication threads should wake up as soon as possible and do not waste CPU time while it is blocked
• Time sharing optimization by setting OS thread priorities and MPI waiting mode
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 56
OS thread priorities & MPI waiting mode
100% 100% use of CPU
100%
MPI calls uses blocking mode and communication thread does not waste resources while it is blocked
computation SMPSs thread
0% 0% use of CPU 0%
communicationSMPSs thread
CPU time
MPI request (~ms)100% use of CPU
End of communication (~us)100% use of CPU
Reducing the priority of the computation threads accelerates execution of the communication thread
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 57
Hierarchy: Hybrid MPI/SMPSs
• Linpack example
• Overlap communication/computation
• Extend asynchronous data-flow execution to outer level
• Automatic lookahead
…
for (k=0; k<N; k++) {
if (mine) {
Factor_panel(A[k]);
send (A[k])
} else {
receive (A[k]);
if (necessary) resend (A[k]);
}
for (j=k+1; j<N; j++)
update (A[k], A[j]);
…
#pragma css task inout(A[SIZE])
void Factor_panel(float *A);
#pragma css task input(A[SIZE]) inout(B[SIZE])void update(float *A, float *B);
#pragma css task input(A[SIZE]) target(comm_thread)
void send(float *A);
#pragma css task output(A[SIZE]) target (comm_thread)
void receive(float *A);
#pragma css task input(A[SIZE]) target
(comm_thread)void resend(float *A);
P0 P1 P2
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 58
MPI/SMPSs LINPACK
#pragma css task input(buf)output(req)target(comm_thread)void send (<type>buf[N*nb],MPI_Request *req)
#pragma css task output(buf, req) target(comm_thread)void recv (<type>buf[count],MPI_Request *req)
#pragma css task input(A) output(panel)highpriorityvoid fact (double A[N/P][NB], double tmp_panel[N/P][NB]); #pragma css task input(panel) inout(A)void update (double tmp_panel[N/P][NB], double A[N/P][NB]); #define NPANELS (N/NB*Q)#define root (j%Q==my_rank) double A[N/P][NPANELS*NB];double tmp_panel[N/P][NB]; int k=0; for( j = 0; j < N; j += NB ){ if (root){ fact (&A[j][j],tmp_panel);
send (buf, req_send); k++; }else{ recv (buf, req_recv); if (necessary) send (buf, req_forward);
}
for(i = k; i< NPANELS; i++ ) update (tmp_panel, &A[j][i*NB],k);}
Factorization phase contains fine grain communications and it is performed by a single task for computation threads; MPI waiting mode
is “polling”
• Keeps the same structure of the code like pure MPI version without look-ahead technique
• Naturally follows critical path in data-flow way
• Automatically overlaps communication and computation
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 59
MPI/SMPSs LINPACK
#pragma css task input(buf)output(req)target(comm_thread)void send (<type>buf[N*nb],MPI_Request *req)
#pragma css task output(buf, req) target(comm_thread)void recv (<type>buf[count],MPI_Request *req)
#pragma css task input(A) output(panel)highpriorityvoid fact (double A[N/P][NB], double tmp_panel[N/P][NB]); #pragma css task input(panel) inout(A)void update (double tmp_panel[N/P][NB], double A[N/P][NB]); #define NPANELS (N/NB*Q)#define root (j%Q==my_rank) double A[N/P][NPANELS*NB];double tmp_panel[N/P][NB]; int k=0; for( j = 0; j < N; j += NB ){ if (root){ fact (&A[j][j],tmp_panel);
send (buf, req_send); k++; }else{ recv (buf, req_recv); if (necessary) send (buf, req_forward);
}
for(i = k; i< NPANELS; i++ ) update (tmp_panel, &A[j][i*NB],k);}
1 MPI process 4 computation threads
1 communication thread
Factorization phase uses 99%cpu time
(polling mode)
Broadcast phase uses < 5% cpu time
(blocking mode)
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 60
MPI + SMPSs: the linpack example
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 61
do iter = 1, niter
call evolve(u0, u1, twiddle, dims(1,1), dims(2,1), dims(3,1))
call fft(-1, u1, u2)
call checksum(iter, u2, dims(1,1), dims(2,1), dims(3,1))
end do
NAS FT Code
61
subroutine fft(dir, x1, x2)...if (dir .eq. 1) then ...else
...else if (layout_type .eq. layout_1d) then call cffts1(-1, dims(1,3), dims(2,3), dims(3,3), x1, x_help)
l1=3 call transpose2_local (dims(1,l1),dims(2, l1)*dims(3, l1),
x_help, x2)call transpose2_global (dims(1,l1)*dims(2, l1),dims(3, l1), x2,
x_help)call transpose2_finish (dims(1,l1), dims(2, l1)*dims(3, l1),
x_help, x2)call cffts2 (-1, dims(1,2), dims(2,2), dims(3,2), x2, x_help)call cffts1 (-1, dims(1,1), dims(2,1), dims(3,1), x_help, x2)
else if ... ... return end
x_help
x_help
inputinout
output
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 62
!$css task
subroutine evolve(u0, u1, twiddle, d1, d2, d3)
implicit none
integer, intent(in) :: d1, d2, d3
double complex, intent(in) :: twiddle(d1*d2*d3)
double complex, intent(inout) :: u0(d1*d2*d3)
double complex, intent(out) :: u1(d1*d2*d3)
end subroutine
Tasks
62
!$css tasksubroutine checksum(i, u1, d1, d2, d3)implicit noneinteger, intent(in) :: i, d1, d2, d3double complex, intent(in) :: u1(d1*d2*d3)end subroutine
!$css task HIGHPRIORITY()
subroutine cffts1(is, d1, d2, d3, x, xout)
implicit none
integer, intent(in) :: is, d1, d2, d3
double complex, intent(in) :: x(d1*d2*d3)
double complex, intent(out) :: xout(d1*d2*d3)
end subroutine
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 63
Tasks
63
!$css task HIGHPRIORITY() subroutine cffts2(is, d1, d2, d3, x, xout)implicit noneinteger, intent(in) :: is, d1, d2, d3double complex, intent(in) :: x(d1*d2*d3)double complex, intent(out) ::
xout(d1*d2*d3)end subroutine
!$css task HIGHPRIORITY()
subroutine transpose2_local(n1, n2, xin, xout)
implicit none
integer, intent(in) :: n1, n2
double complex, intent(inout) :: xin(n1*n2)
double complex, intent(inout) :: xout(n1*n2)
end subroutine
!$css task TARGET(COMM_THREAD)subroutine transpose2_global(n1, n2, xin, xout)implicit noneinteger, intent(in) :: n1, n2double complex, intent(inout) :: xin(n1*n2)double complex, intent(inout) :: xout(n1*n2)end subroutine
!$css task HIGHPRIORITY()
subroutine transpose2_finish(n1, n2, xin, xout)
implicit none
integer, intent(in) :: n1, n2
double complex, intent(inout) :: xin(n1*n2)
double complex, intent(inout) :: xout(n1*n2)
end subroutine
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 64
Task graph
64
evolveevolve
cffts1cffts1
cffts2cffts2
transp_localtransp_local
transp_globaltransp_global
cffts1cffts1
check sumcheck sum
...
transp_finishtransp_finish
MPI_all_to_all
evolveevolve
cffts1cffts1
cffts2cffts2
transp_localtransp_local
transp_globaltransp_global
cffts1cffts1
check sumcheck sum
transp_finishtransp_finish
evolveevolve
cffts1cffts1
cffts2cffts2
transp_localtransp_local
transp_globaltransp_global
cffts1cffts1
check sumcheck sum
transp_finishtransp_finish
P0 P1 PN-1
Loop (one iteration per process)
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 65
Task graph
65
evolveevolve
cffts1cffts1
cffts2cffts2
transp_local
transp_local
transp_global
transp_global
cffts1cffts1
check sum
check sum
transp_finish
transp_finish
MPI_all_to_all
P0
evolveevolve
cffts1cffts1
cffts2cffts2
transp_local
transp_local
transp_global
transp_global
cffts1cffts1
check sum
check sum
transp_finish
transp_finish
evolveevolve
cffts1cffts1
cffts2cffts2
transp_local
transp_local
transp_global
transp_global
cffts1cffts1
check sum
check sum
transp_finish
transp_finish
MPI_all_to_all
P0
evolveevolve
cffts1cffts1
cffts2cffts2
transp_local
transp_local
transp_global
transp_global
cffts1cffts1
check sum
check sum
transp_finish
transp_finish
evolveevolve
cffts1cffts1
transp_local
transp_local
Several iterations
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 66
Performance analysis with Paraver
• Paraver traces are generated both for pure SMPSs and for MPI/SMPSs
• For pure SMPSs:
• Simply compile with “-t” option
• Run as usual, the paraver file is generated
• For MPI/SMPSs applications
• Compile with “-t” option
• Set the LD_PRELOAD variable
export LD_PRELOAD=$MPItrace_PATH/libsmpssmpitrace.so
• More details in the SMPSs manual
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 67
Performance analysis
• SMPSs comes with a set of Paraver configuration files
• $(SMPSs)/share/cellss/paraver_cfgs/
• Also, Paraver distribution comes with a set of Paraver configuration files, for example for specific MPI analysis
• Some available configuration files:
• task.cfg: view of the tasks, different color means different task type
• execution_phases.cfg: shows the activity of threads, i.e., percentatge of time in each different phase (adding task, executing a task...)
• task_number.cfg: shows id number of the tasks being executed
• flushing.cfg: shows period when application is writing the tracefile, i.e, in this period the tracefile is not significant
• Task_profile.cfg: Classiffication of tasks versus threads
• 3dh_duration_tasks.cfg: tasks duration histogran, van be viewed by task type
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 68
Performance analysis
Rosa M. Badia, . StarSs tutorial @ BSC. October 2010 69
Conclusion
• StarSs is an active research project
• With a global view on multicore and parallel programming
• Stable enough to start porting relevant production codes at BSC
• With a lot of potential still to be investigated.