21
HUST StreamX10: A Stream Programming Framework on X10 Haitao Wei 2012-06-14 School of Computer Science at Huazhong University of Sci&Tech

StreamX10: A Stream Programming Framework on X10 Haitao Wei 2012-06-14 School of Computer Science at Huazhong University of Sci&Tech

Embed Size (px)

Citation preview

HUST

StreamX10: A Stream Programming Framework on

X10

Haitao Wei2012-06-14

School of Computer Science at Huazhong University of Sci&Tech

2

Outline

Introduction and Background1

COStream Programming Language2

Stream Compilation on X103

Experiments4

Conclusion and Future Work5

Background and motivition

Stream Programming A high level programming model that has been

productively applied Usually, depends on the specific architectures which

makes it difficult to port between different platforms X10

a productive parallel programming environment isolates the different architecture details provides a flexible parallel programming abstract layer

for stream programming

StreamX10: try to make the stream program portable based on X10

4

Outline

Introduction and Background1

COStream Programming Language2

Stream Compilation on X103

Experiments4

Conclusion and Future Work5

COStream Language

5

stream FIFO queue connecting operators

operator Basic func unit—actor node in stream graph

Multiple inputs and multiple outputs Window

– like pop, peek, push operations

Init and work function

composite Connected operators—subgraph of actors A stream program is composed of composites

COStream and Stream Graph

6

Composite Main{ graph

stream<int i> S = Source(){ state :{ int x;}

init :{x=0;} work :{

S[0].i = x; x++;

} window S:tumbling,count(1);

} streamit<int j> P = MyOp(S){

param pn:N } () as SinkOp = Sink(P){ state :{int r;} work :{ r = P[0].j; println(r); } window P: tumbling,count(1); }}

Composite MyOp(output Out ; input In){ param attribute:pn graph stream<int j> Out = Averager(In){ work :{ int sum=0,i; for(i=0;i<pn;i++) sum += In[i],j; Out[0].j = (sum/pn); } window In: sliding,count(10),count(1); Out:tumbling,count(1); }}

stream

operator

composite

Sourc

e

SinkAverager

push=1

peek=10pop=1

push=1

pop=1

S P

7

Outline

Introduction and Background1

COStream Programming Language2

Stream Compilation on X103

Experiments4

Conclusion and Future Work5

Compilation flow of StreamX10

Phrase Function

Front-endTranslates the COStream syntax into abstract syntax tree.

InstantiationInstantiates the composites hierarchically to static flattened

operators.

Static Stream GraphConstructs static stream graph from flattened operators.

SchedulingCalculates initialization and steady-state execution orderings of

operators.

PartitioningPerforms partitioning based on X10 parallelism models for load

balance.

Code GenerationGenerates X10 code for COStream programs.

The Execution Framework

9

Place 0 Place 1 Place 2

activity activity activity

Local buffer object Global buffer object

Data flow intra place Data flow inter place

threads pool

The node is partitioned between the places Each node is mapped to an activity The nodes use the pipeline fashion to exploit the parallelisms The local and Global FIFO buffer are used

Work Partition Inter-place

10Objective:Minimized Communication and Load Balance (Using Metis)

10

2 2 2

2

10

2

1

55

55

5

5

1

Comp. work=10

Comp. work=10

Comp. work=10

Speedup: 30/10 =3Communication: 2

Global FIFO implementation

11

Producer Consumer

10 n… 10 n…

10 n…

push peek/pop

copy copy

Local Array DistArrayPlace0 Place1

Each Producer/Consumer has its own local buffer the producer uses push operation to store the data to the local

buffer The consumer uses peek/pop operation to fetch data from the local

buffer When the local buffer is full/empty is data will be copied

automatically

X10 code in the Back-end

12

Spawn activities for each node at place according to the partition

Call the work function in initial and steady schedule

Define the work function

//main.x10 control codepublic static def main( ) { ... finish for (p in Place.places()) async at (p) {

switch(p.id){ case 0:

val a_0 = new Source_0(rc); a_0.run(); break; case 1:

val a_2 = new MovingAver_2(rc); a_2.run(); break; case 2: val a_1 = new Sink_1(rc); a_1.run(); break; default: break; }

} …}

//Source.x10 code ... def work(){ ... push_Source_0_Sink_1(0).x=x; x+=1.0; pushTokens(); popTokens(); } public def run(){ initWork();//init // initSchedule for(var j:Int=0;j<Source_0_init;j++) work(); //steadySchedule for(var i:Int=0;i<RepeatCount;i++) for(var j:Int=0;j<Source_0_steady;j++) work(); flush(); } ...

13

Outline

Introduction and Background1

COStream Programming Language2

Stream Compilation on X103

Experiments4

Conclusion and Future Work5

Experimental Platform and Benchmarks

14

Platform Intel Xeon processor (8 cores ) 2.4 GHZ with 4GB

memory Radhat EL5 with Linux 2.6.18 X10 compiler and runtime used are 2.2.0

Benchmarks Rewrite 11 benchmarks from StreamIt

The throughputs comparison

15

Throughputs of 4 different configurations (NPLACE*NTHREAD=8) Normalized to 1 place with 8 threads

0

1

2

3

4

5

6

7

8

9

10

Thro

ughp

ut n

orm

aliz

ed to

1 p

lace

wit

h 8

thre

ads

NPLACES=1, NTHREADS=8

NPLACES=2, NTHREADS=4

NPLACES=4, NTHREADS=2

NPLACES=8, NTHREADS=1

• for most benchmarks, CPU utilization increases from 24% to 89% ,when places varies from 1 to 4, except for the benchmark with low computation/communication ratio

• benefits are little or worse when the number of places increases from 4 to 8

Observation and Analysis

16

The throughput goes up when the number of places increases. This is because that multiple places increase the CPU utilization

Multiple places show parallelism but also bring more communication overhead

Benchmarks with more computation workload like DES and Serpent_full can still benefit form the number of places increasing

17

Outline

Introduction and Background1

COStream Programming Language2

Stream Compilation on X103

Experiments4

Conclusion and Future Work5

Conclusion

We proposed and implemented StreamX10, a stream programming language and compilation system on X10

A raw partitioning optimization is proposed to exploit the parallelisms based on X10 execution model

Preliminary experiment is conducted to study the performance

18

Future Work

How to choose the best configuration (# of places and # of threads) automatically for each benchmark

How to decrease the thread switching overhead by mapping multiple nodes to the single activity

19

Acknowledgment

X10 Innovation Award founding support

QiMing Teng, Haibo Lin and David P. Grove at IBM for their help on this research

20

HUST