52
StarSs and CellSs: Data Flow Programming for Multicore Systems Jesus Labarta Computer Science Research Department Director BSC

StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

StarSs and CellSs:Data Flow Programming for

Multicore Systems

Jesus LabartaComputer Science Research Department Director

BSC

Page 2: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 2

Back to Babel?

“Now the whole earth had one language and the same words” …

Book of Genesis

…”Come, let us make bricks, and burn them thoroughly. ”…

…"Come, let us build ourselves a city, and a tower with its top in the heavens, and let us make a name for ourselves”…

And the LORD said, "Look, they are one people, and they have all one language; and this is only the beginning of what they will do; nothing that they propose to do will now be impossible for them. Come, let us go down, and confuse their language there, so that they will not understand one another's speech."

The computer age

Fortran & MPI

++

Fortress

StarSsOpenMP

MPI

X10

Sequoia

CUDASisal

CAFSDKUPC

Cilk++

Chapel

HPF

ALF

RapidMind

Page 3: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 3

Key issues

Asynchronism

Productivity

Processor-memory bandwidth

Variance

Portability

% memory used

Load balance

Malleability

Memory association

Scalability

Programmability Maintainability

Heterogeneous functionality and

performance

Hierarchy

Dynamic environment

Accepted by developers

Abstraction

Page 4: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 4

Key issues

Asynchronism

Productivity

Processor-memory bandwidth

Variance

Portability

% memory used

Load balance

Malleability

Memory association

Scalability

Programmability Maintainability

Heterogeneous functionality and

performance

Hierarchy

Dynamic environment

Accepted by developers

AbstractionDazzled by

Performance

Page 5: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 5

Address spaces in multicores

C

D

S

C

D

S

C

D

S

Homogeneousin structure

Heterogeneous

SPEPPE SPE SPE SPE SPE SPE SPE SPE

Cell

Niagara, Nehalem, Power….

GPURoad Runner

A really fuzzy spaceDifferent cache

hierarchies

How much visibleto the programmer???

Page 6: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 6

The presentation

• Environment, philosophy,…• CellSs• Work in Progress• SMPSs• MPI+SMPSs• Summary

Page 7: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 7

Mare Incognito

“cheap”, “efficient”, “not architecture specific” and

“wide applicability” machines based on it

We believe it is possible to build a

“Homogeneous” Cell based system

We have “a vision” of relevant technologies to develop

Many of these technologies are not Cell specificand will be also evaluated/used for other architectures

We know it is risky

An IBM-BSC joint research project towards a 10+ PF supercomputer

Page 8: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 8

Project Structure

Performance analysis and

PredictionTools

Processorand node

Load balancing

Interconnect

Applicationdevelopment

an tuning

Fine-grainprogramming

models

Model andprototype

Page 9: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 9

Programming model for MareIncognito?

• How much pressure can we put on our (BSC) program developers?

• Drastic changes?

• Hierarchy

• MPI

• OpenMP,StarSs, …..

• “beneficial” transformations:

• Blocking

• Better understanding of inputs and outputs

• Understanding potential asynchronism

• Incremental

Performancetools

Processorand node

Load balancing

Interconnect

ApplicationsProgramming

models

Modelsand

prototype

Page 10: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 10

How my daughter sees the Cell

Page 11: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 11

How Mateo sees the Cell

Dual XDRTM

(up to25,2 GByte/s)

PPEdual-threaded

EIB (up to 96 Bytes/cycle)4 x 16B data rings at half core clock frequency

FlexIO6,4 GByte/s per link(76.8 total)

BICMIC

16B/cycle 16B/cycle 16B/cycle(2x)

32B/cycle

L2

L1

16B/cycle

32 KB I$32 KB D$

512 KB

SXU

LS

MFC

SXU

LS

MFC

SXU

LS

MFC

SXU

LS

MFC

SXU

LS

MFC

SXU

LS

MFC

SXU

LS

MFC

SXU

LS

MFC

16B/cycle(each)

Page 12: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 12

How I see the Cell

SPEPPE SPE SPE SPE SPE SPE SPE SPE

Thin processorSMT

Simple hardwareAggregate performanceVector instructions

Necessity virtue

Hard to optimizeOnly vector instructionsAlignment

Separate address spacesExplicit DMAsTiny local memory

Explicit many memory transactionBandwidth

Page 13: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 13

≅ ≅

ns 100 useconds minutes/hours

A perspective on architectures and programming models

Grid

Mapping of concepts:Instructions Block operations Full binaryFunctional units SPUs machinesFetch &decode unit PPE home machineRegisters (name space) Main memory FilesRegisters (storage) SPU memory Files

GranularityStay sequential

Just look at things from a bit further awayArchitects do know how to run parallel

Page 14: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 14

StarSs

• Programability

• Standard sequential look and feel (C, Fortran)

• Incremental parallelization/restructure

• Abstract/separate algorithmic issues from resources

• Methodology/practices• Block algorithms: modularity• “No” side effects: local addressing• Promote visibility of “Main” data• Explicit synchronization variables

• Portability

• Runtime for each type of target platform.• Matches computations to resources• Achieves “decent” performance

• Even to sequential platform

• Single source for maintained version of a application

∗ SsCellSs

SMPSs GPUSs

GridSs

Page 15: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 15

CellSs

Page 16: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 16

CellSs concept: matrix multiply example

int main (int argc, char **argv) {int i, j, k;…

initialize(A, B, C);

for (i=0; i < NB; i++) for (j=0; j < NB; j++)

for (k=0; k < NB; k++)block_addmultiply( C[i][j], A[i][k], B[k][j]);

}

BS

BSNB

NB

BS

BS

static void block_addmultiply( float C[BS][BS], float A[BS][BS], float B[BS][BS]) {int i, j, k;

for (i=0; i < BS; i++) for (j=0; j < BS; j++)

for (k=0; k < BS; k++) C[i][j] += A[i][k] * B[k][j];

}

Page 17: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 17

CellSs concept: matrix multiply example

int main (int argc, char **argv) {int i, j, k;…

initialize(A, B, C);

for (i=0; i < NB; i++) for (j=0; j < NB; j++)

for (k=0; k < NB; k++)block_addmultiply( C[i][j], A[i][k], B[k][j]);

}#pragma css task input(A, B) inout(C)static void block_addmultiply( float C[BS][BS], float A[BS][BS], float B[BS][BS]) {int i, j, k;

for (i=0; i < BS; i++) for (j=0; j < BS; j++)

for (k=0; k < BS; k++) C[i][j] += A[i][k] * B[k][j];

}

BS

BSNB

NB

BS

BS

Dynamically defined instruction set

Modularity

Page 18: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 18

CellSs compilation environment

app.c

CSS compiler

app_spe.c

app_ppe.c

llib_css-spe.so

Cell executable

llib_css-ppe.so

SPE Linker

PPE Linker

SPEexecutableSPE Compiler app_spe.o

PPE Compiler app_ppe.oSPE Embedder

SPE Linker

PPEObject

SDK

Page 19: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 19

CellSs execution environment

Page 20: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 20

CellSs: argument renaming

• Supports:

• Decouple data declaration/allocation from storage

• Increase parallelism potential

• Adapt to available memory• When

• Eager: at task creation time

• Lazy: at task execution time• Where to dynamically allocate memory

• Main memory (GB)

• LS: avoid transfers• How much

• Limit the size of memory used for renaming. Stall when reached.• Avoid swapping• Impact on locality

• Programmability

• Restrict/focus Important data declarations

T1_1

T2_1

T3_1

T1_2

T2_2

T3_2

T1_N

T2_N

T3_N

block1block1 Block1.NBlock1.1

for (i=0; i<N; i++) {T1 (…,…, block1);T2 (block1, …, block2[i]);T3 (block2[i],…,…);

}

Page 21: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 21

CellSs Programming model: Dependence detection

@L

Type, size,…

*obj*producer

*prev

Type, size,…

*obj*producer

*prev

Objectinstance

Objectinstance

Objectinstance

Type, size,…

*obj*producer

*prev

# users # users

# users

Renaming table

Instruction Window: 40000+ !!!!Speculation: neeeded?

Page 22: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 22

Memory

• Object cache• Reuse SPE Local Store• Avoid DMA transfers

• Double buffer• Automatic: before executing a task, launch

• DMA in for arguments for the next• DMA out for results of previous

• Dependent on object sizes – LS pressure• Handle alignment issues

• Locality:• can scheduler

• Maximize the reuse of LS ?• Can alleviate the off chip bandwidth bottleneck

• Knowing data accesses

Page 23: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 23

CellSs: continuous runtime tuning

• Master overhead

• ~3 us. Something to control, minimize

• Dependence• storage capacity/addressing ↔ task granularity

• How will it evolve?• Future Cell• Other platforms (Roadrunner)

• Distribute implementation? Offload?

• Multiple *Ss engines

1 2 3 4 5 6 7 81

1,52

2,53

3,54

4,55

5,56

6,57

7,58

Matmul scalability

2023281,79117,9658,9128,4622,43

# SPUs

Spe

ed-u

p

July 2007

1 2 3 4 5 6 7 81

1,52

2,53

3,54

4,55

5,56

6,57

7,58

Matmul scalability

2023281,79117,9658,9128,4622,43

#SPUs

Spe

ed-u

p

November 2007

1 2 3 4 5 6 7 81

1,52

2,53

3,54

4,55

5,56

6,57

7,58

Scalability analysis of matrix multiply

2022,77 usecs281,32 usecs117,47 usecs58,46 usecs27,87 usecs21,86 usecs

#SPUs

Spee

d up

April 2007

Matrix Multiply

0 1024 2048 3072 4096

0102030405060708090

100110120130

Cholesky Performance

Matrix size

GFl

ops

Cholesky

Page 24: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 24

Flexible/dynamic model: SparseLU example

int main(int argc, char **argv) {int ii, jj, kk;…

for (kk=0; kk<NB; kk++) {

lu0(A[kk][kk]);for (jj=kk+1; jj<NB; jj++)

if (A[kk][jj] != NULL)

fwd(A[kk][kk], A[kk][jj]);for (ii=kk+1; ii<NB; ii++)

if (A[ii][kk] != NULL) {

bdiv (A[kk][kk], A[ii][kk]);for (jj=kk+1; jj<NB; jj++)

if (A[kk][jj] != NULL) {

if (A[ii][jj]==NULL)

A[ii][jj]=allocate_clean_block();

bmod(A[ii][kk], A[kk][jj], A[ii][jj]);}

}

}

}

B

B

NB

NB

B

B

void lu0(float *diag);

void bdiv(float *diag, float *row);

void bmod(float *row, float *col, float *inner);

void fwd(float *diag, float *col);

Page 25: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 25

Dynamic main memory allocationData dependent parallelism

int main(int argc, char **argv) {int ii, jj, kk;…

for (kk=0; kk<NB; kk++) {

lu0(A[kk][kk]);

for (jj=kk+1; jj<NB; jj++)

if (A[kk][jj] != NULL)

fwd(A[kk][kk], A[kk][jj]);

for (ii=kk+1; ii<NB; ii++)

if (A[ii][kk] != NULL) {

bdiv (A[kk][kk], A[ii][kk]);

for (jj=kk+1; jj<NB; jj++)

if (A[kk][jj] != NULL) {

if (A[ii][jj]==NULL)

A[ii][jj]=allocate_clean_block();

bmod(A[ii][kk], A[kk][jj], A[ii][jj]);

}

}

}

}

Flexible/dynamic model: SparseLU example

#pragma css task inout(diag[B][B]) highpriorityvoid lu0(float *diag);

#pragma css task input(diag[B][B]) inout(row[B][B])void bdiv(float *diag, float *row);

#pragma css task input(row[B][B],col[B][B]) inout(inner[B][B])void bmod(float *row, float *col, float *inner);

#pragma css task input(diag[B][B]) inout(col[B][B])void fwd(float *diag, float *col);

1 2 3 4 5 6 7 8

1

1,5

2

2,5

3

3,5

4

4,5

55,5

6

6,5

7

SparseLU speedup result s

SPUs

Spee

dUp

Page 26: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 26

Distant/interprocedural parallelism: Test LU example

• Ideally: unrolling of whole program

int main(int argc, char* argv[])

{

long t_start,t_end;

double time;

genmats(L,U);

clean_mat (A);

sparse_matmult (L, U, A);

copy_mat (A, origA);

t_start=usecs();

LU (A);

t_end=usecs();

time = ((double)(t_end-t_start))/1000000;

split_mat (A, L, U);

clean_mat (A);

sparse_matmult (L, U, A);

compare_mat (origA, A, &are_equal);

#pragma css wait on (&are_equal)

if (are_equal) printf(“Great!\nAnd only took %f\n”,time);

}

Page 27: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 27

Work in Progress

Page 28: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 28

N Dimensional data strcutures: Syntax

#pragma css task \input(parameter_list)opt output(parameter_list)opt inout(parameterlist)opt highpriorityopt

function_definition or function_declaration

parameter:name dimension_specifier_listopt range_specifier_listopt

range_specifier:{}{index}{begin..end}{begin:length}

dimension_specifier:[length]

A A{}{} A{i..j}{}

ij

A{}{k..m}

k m

A{i..j}{k..m}

ij

k m

Page 29: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 29

N Dimensional data structures

• How to specify subsections of array structures

#pragma css task input( a{i:BS}{k:BS}, b{k:BS}{j:BS}, i, j, k ) \inout( c{i:BS}{j:BS} )

void matmul(float a[N][N], float b[N][N], float c[N][N], int i, int j, int k)

int main(...) {...

for (i=0; i<N; i+=BS)‏for (j=0; j<N; j+=BS)‏

for (k=0; k<N; k+=BS)‏matmul(a,b,c,i,j,k);

...

Page 30: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 30

N Dimensional data structures

• How to specify subsections of array structures

#pragma css task input(transa, transb, m, n, k, alpha, lda, ldb, beta, ldc, \a[*k][*lda] {}{0:*m}, \b[*n][*ldb] {}{0:*k}) \inout(c[*n][*ldc] {}{0:*m}

void dgemm(char *transa, char *transb, \integer *m, integer *n, integer *k, double *alpha, \double *a, integer *lda, \double *b, integer *ldb, \double *beta, double *c, integer *ldc);

Page 31: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 31

Global memory addressing

• Local addressing

• Enables automatic double buffering• large no speculative prefetching

• Task granularity

• Depends on operands size• Limited by local storage capacity

• Limits scalability given a task handling cost

• Global addressing decouples granularity and storage capacity

• Manual: • Already possible (mfc_read,….)• Escapes the dependence control mechanism. User responsibility.

• Automatic:• Requires caching mechanism

• Can be cleanly combined?

• Separation of dependence and data transfer

Page 32: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 32

Nested StarSs

• Hierarchical task graphs• Concurrent generation to flat task graph

Model: OpenMP + StarSs, ??Opportunity for Transactional Memory within Run Time

Page 33: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 33

Reductions

• Similar to OpenMP syntax

• Private copy initialized to operator neutral value

• Local update during task

• Atomic update of global value

• Dependence semantics

• Commutative

• Count of number of users

To appear on next distribution

Page 34: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 34

Memory association

• Data layout

• Main memory and LS layout of data blocks• Necessarily different because of LS size.• If we have to address the issue, can we exploit it more?

• ≈ work arrays

• Potential to handle automatically• Avoid specific user declarations and code. Abstraction.• Several instances of an object, with different layouts.• Dynamically managed by run time. (cache)

• Code restructuring

• change access patternsKey challenge

Page 35: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 35

SMPSs

Page 36: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 36

1 2 3 4 5 6 7 80

2,5

5

7,5

10

12,5

15

17,5

20

22,5

25

LU performance

Machine peak102420484096

Number of threads

GFl

ops

SMPSs

• “Same” source code

• Although more flexibility (!block size,…..)• Same compiler

• Although not force to address the memory association issue

• SMP run time

• Shared memory implementation

2 ways POWER 5

Page 37: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 37

Asynchronism: I/O

• Trace handling and visualization tools

• Statistics

• Filters

#pragma css task inout(fd)

output(buffer, R_recs, end_trace) highpriority

void Read (FILE *fd, char buffer[500][4096], int *R_recs,

int *end_trace);

#pragma css task input(buffer_in, R_recs, control)

output(buffer_out, W_recs)

void Process (char buffer_in[500][4096], int R_recs,

proc_ctrl control,

char buffer_out[500][4096], int *W_recs);

#pragma css task input(buffer, W_recs)

inout(fd, total_records)

void Write (FILE *fd, char buffer[500][4096], int W_recs,

unsigned long long *total_records);

int main()

{ …

while(!end_trace) {

Read (infile, buffer_in, &RR, &end_trace);

Process (buffer_in, RR, control, &RW, buffer_out);

Write (outfile, buffer_out, RW, &total_records);

#pragma css wait on (&end_trace)

}

}

Page 38: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 38

Asynchronism: I/O

• Asynchronism

• Decoupling/overlaping I/O and processing

• Serialization of I/O

• Resource constraints• Request for specific thread,…

• Task duration variance• Dynamic schedule

Duration histogram

0

2

4

6

8

10

12

14

16

18

1 16 31 46 61 76 91 106 121 136 151 166 181 196

time

nb o

curr

ence

s

Read record

Page 39: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 39

SMPSs vs OpenMP 3.0 vs Cilk

Multi sortN queens

Sparse LU

• Benchmarks used for OpenMP 3.0 development

• Similar performance in some ranges

• Overlap potential in SMPSs

• Programmability issues• Reductions, memory allocations,

synchronization representatives, nesting ,…

Page 40: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 40

MPI + SMPSs

Page 41: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 41

MPI + SMPSs

• Linpack

• Simplify program• Same source code for any level of lookahead (0-n). No specific code

needed inside each computation routine

• Decouple/overlap computation and synchronization• Tolerate slow networks• Tolerance to OS noise• Programmer can concentrate on other issues

• Load balance,…

• Smaller problem size to get max performance• Memory used if necessary by run time (within limited renaming space)

Linpack distributionInclude directory: 19 files, 1946 code linessrc directory: 113 files, 11945 code linesParameters: P, Q, Broadcast, lookahead

Page 42: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 42

…for (k=0; k<N; k++) {

if (mine) {

Factor_panel(A[k]);

send (A[k])

} else {

receive (A[k]);

if (necessary) resend (A[k]);

}

for (j=k+1; j<N; j++)

update (A[k], A[j]);

MPI + SMPSs

• Extend asynchronism to outer level

#pragma css task inout(A[SIZE])void Factor_panel(float *A);

#pragma css task input(A[SIZE]) inout(B[SIZE])void update(float *A, float *B);

#pragma css task input(A[SIZE])void send(float *A);

#pragma css task output(A[SIZE])void receive(float *A);

#pragma css task input(A[SIZE])void resend(float *A);

Page 43: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 43

MPI + SMPSs

• Overlap communication and computation

• Asynchronous/immediate MPI_calls + wait tasks• Restartable task

• Avoid deadlock

• Avoid inefficient use of resources (busy wait)

#pragma css task input(A[SIZE]) output(send_req)void Isend(float *A, int *send_req)

{

MPI_isend (A,…);

}

#pragma css task input(send_req) inout(A{SIZE])void Wait_Isend(int *send_req, float *A)

{ int ierr, go;ierr = MPI_Test(send_req,&go,…)if(go==0) #pragma css restartierr = MPI_Wait(send_req,…);

}

#pragma css task output(recv_req,A[SIZE])void Ireceive(int *recv_req, float *A)

{

MPI_Irecv(A,…)

}

#pragma css task input(recv_req) inout(A[SIZE])void Wait_Ireceive(int *recv_req, float *A)

{ int ierr, go;ierr = MPI_Test(recv_req,&go,…)if(go==0) #pragma css restartierr = MPI_Wait(recv_req,…);

}

Page 44: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 44

MPI + SMPSs

Page 45: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 45

MPI + SMPSs

Page 46: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 46

Summary

Page 47: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 47

StarSs

• Parallelism• Automatic run time parallelism

detection and exploitation• Unstructured (DAG)• Data dependent parallelism• Exploit distant, interprocedural

parallelism• Asynchronous - OoO - Dynamic

data flow execution• Tolerates variance• Extends to outer programming

models/issues• I/O• MPI + StarSs

• Malleable• Incremental code parallelization

• Memory• Single global address space• Renaming:

• Decouple data declaration/allocation from storage (available size and structure)

• Increase parallelism potential• Adapt to available memory

• Potential for automatic handling of memory association issues

• Memory wall:• Latency tolerance

• Automatic double buffering• DMA transfers: ~ vector loads

• Automatic locality optimization

Page 48: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 48

StarSs

• Programming practices

• Make visible relevant data structures

• Explicit synchronization variables

• Does require some effort, but globally beneficial.

• Clean/structure code

Page 49: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 49

Convergence? (Hierarchical) Integration?

StarSsImplicit parallelism

“atomic” tasksSingle work generator

Local addressing

OpenMPExplicit parallelism

Nested tasksGlobal Addressing

TMMechanisms guaranteeing atomicity

Retry mechanismsSpeculation detection of dependences,….

OpenMP 3.X ?

Page 50: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 50

StarSs: issues and ongoing efforts

• CellSs programming model

• Memory association

• Array regions

• Subobject accesses

• Access to global memory by tasks?• Blocks larger than Local Store.

• Inline directives

• Reductions

• Separation Sync – data transfer

• Resource constraints

• Nesting:• Hyerarchical task graph• Flat task graph

• Porting:

• P6 (dcbc)

• BG

• Runtime

• Further optimization of overheads. Arquitectural support

• Minimal set of synchronization arcs

• scheduling algorithms: overhead, locality

• overlays

• Short circuiting (SPE SPE transfers)

• Debugging

• Amenable to race detection tools

• Compiler

• Dusty deck StarSs

• Applications

• Porting from sequential

• Port from parallel

• New algorithmshttp://www.bsc.es/cellsuperscalarhttp://www.bsc.es/smpsuperscalar

Page 51: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 51

The importance of …

• Language• Transport / integrate

• Experience Good “predictor”• Structures mind

• Abstraction

• Details• Very “simple/small/subtle” differences very important impact

Neanderthalhigh larynx

H. sapienslow larynx

Page 52: StarSs and CellSs: Data Flow Programming for Multicore Systemssti.cc.gatech.edu/Slides2008/Labarta-080710.pdf · 2008-07-11 · An IBM-BSC joint research project towards a 10+ PF

Jesus Labarta, STI workshop on Soft. and Apps. for the Cell/B.E., 2008 52

The importance of …

• Language• Transport / integrate

• Experience Good “predictor”• Structures mind

• Abstraction

• Details• Very “simple/small/subtle” differences very important impact

Neanderthalhigh larynx

H. sapienslow larynx

Wish we were able to develop language:

Communicated to humans Express Ideas

Exploitable by machines