42
Programming the Energy Efficiency of High Performance Computing Systems Nikolopoulos, D. S. (2013). Programming the Energy Efficiency of High Performance Computing Systems. Abstract from Fourth International Conference on Energy-Aware High Performance Computing, Dresden, Germany. https://verc.enes.org/community/announcements/events/ena-hpc-2013-fourth-international- conference-on-energy-aware-high-performance-computing Document Version: Early version, also known as pre-print Queen's University Belfast - Research Portal: Link to publication record in Queen's University Belfast Research Portal Publisher rights © 2013 The Author General rights Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the Research Portal that you believe breaches copyright or violates any law, please contact [email protected]. Download date:01. May. 2022

Programming the Energy Efficiency of High Performance

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Programming the Energy Efficiency of High Performance

Programming the Energy Efficiency of High Performance ComputingSystems

Nikolopoulos, D. S. (2013). Programming the Energy Efficiency of High Performance Computing Systems.Abstract from Fourth International Conference on Energy-Aware High Performance Computing, Dresden,Germany. https://verc.enes.org/community/announcements/events/ena-hpc-2013-fourth-international-conference-on-energy-aware-high-performance-computing

Document Version:Early version, also known as pre-print

Queen's University Belfast - Research Portal:Link to publication record in Queen's University Belfast Research Portal

Publisher rights© 2013 The Author

General rightsCopyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or othercopyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associatedwith these rights.

Take down policyThe Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made toensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in theResearch Portal that you believe breaches copyright or violates any law, please contact [email protected].

Download date:01. May. 2022

Page 2: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

Programming  the  Energy-­‐Efficiency  of  High-­‐Performance  Compu;ng  Systems  

Professor  Dimitrios  S.  Nikolopoulos  HPDC  Research  Cluster,  Queen’s  University  of  Belfast  

Page 3: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

Points  to  get  across  

•  Waste-­‐free  HPC  soGware  is  instrumental  in  the  baJle  against  power  

•  Scale-­‐freedom  of  HPC  is  a  path  to  energy-­‐efficiency  

•  Energy  challenges  will  remind  us  of  the  Hydra  Lernaia  

 

2  

Page 4: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

Energy  in  HPC  

0 50 100 150 200 250 300

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

Papers tagged "Power/Energy/HPC" ACM Digital Library

3  

Page 5: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

HPC  has  lead  the  way  (or  not?)  

•  The  history  of  BlueGene  –  Based  on  processors  for  the  embedded  systems  market  (PowerPC)  –  Pioneered  “scale-­‐out”  idea,  now  common  in  datacentres  

•  Many  nodes  with  simple  cores,  fast  interconnect  –  Dominated  Top-­‐500,  Green-­‐500  list  

•  Embedded  processors  are  now  commodity  components  –  Able  to  power  compe;ng  supercomputers  (e.g.  BSC  MontBlanc)  

4  

Page 6: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

Leader  or  laggard?  

5  

Page 7: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

Leader  or  laggard?  

•  Is  HPC  reusing  or  discovering?  –  Processors  originally  designed  for  the  mobile  phones  market  

–  Clock  ga;ng,  DVFS,  device  sleep  states  well  known  for  20  years  

6  

Page 8: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

What  can  HPC  contribute  towards  zero-­‐power  compu:ng?  

7  

Page 9: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

The  challenge  and  the  opportunity  •  Assume  that  currently  most  energy-­‐efficient  

supercomputer  sustains  improvement  towards  an  Exaflop  –  Will  need  2384×  in  performance,  202.7  MW  

•  Assume  target  power  cap  of  25  MW  –  Need  two  orders  of  magnitude  improvement  in  FLOPS/W  –  Can  hardware  achieve  this  improvement  without  compromising  the  power  target?  

–  If  systems  have  any  hope  to  achieve  this  they  must  eliminate  waste  

–  Actual  power  cap  may  be  lower  than  nominal  power  consump;on  

–  Opportuni;es  for  soGware!  

8  

Page 10: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

Where  can  HPC  make  the  difference?  

•  HPC  priori;ses  efficiency  in  programming  – Minimise  communica;on  –  Balance  the  load  –  U;lise  available  cores  –  Reduce  cache  and  memory  footprint  

•  HPC  has  been  leading  the  way  in  parallel  programming  technology  –  Parallel  languages,  compilers,  run;me  systems  

•  Waste-­‐free  parallel  programming  is  energy-­‐efficient  –  Opportunity  to  reduce  power  conusmp:on  –  Opportunity  to  do  more  within  a  power  budget  – With  suitable  language  &  run:me  support  

9  

Page 11: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

What  can  parallel  programming  do  for  energy-­‐efficiency?  

10  

Page 12: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

If  parallel  programs  were  scale-­‐free  •  Power  increasing  linearly  with  

ac;ve  cores  –  Previously  dynamic,  but  now  also  sta;c  power  

•  Program  speedup  lines  have  knees  –  Synchronisa;on,  dependencies,  or  the  algorithm  itself  

–  Energy-­‐efficient  programs  would  execute  at  the  beginning  of  the  knee  

–  How  do  we  locate  the  knee?    

Cores  

Speedu

p  

CT  opportunity  

CT  opportunity  

11  

Page 13: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

Exploring  the  concurrency-­‐power  trade-­‐off  

•  Programs  execute  dis;nct  phases    –  Programmer  annotated  or  auto-­‐

mined  from  ;me  series  of  HPMs  –  Compute-­‐,  memory-­‐  or  

communica;on-­‐bound  

•  Dynamic  scalability  predictors  –  Concurrency  sweet  spot  

detec;on  with  empirical  modeling  [ICS06]  

–  Rigorous  sta;s;cal  modeling  [TPDS08]  

–  Machine  learning  approaches  [EuroPar10]  

–  MPI  task  aggrega;on  [IPDPS10]  

Cores  Speedu

p  

CT  opportunity  

CT  opportunity  

12  

Page 14: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

Scale-­‐freedom  improves  energy-­‐efficiency  Scale-­‐free  parallel  programs  can  control  their  power  budget  Scale-­‐free  parallel  programs  adapt  to  power  caps  

13  

Page 15: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

Is  controlling  concurrency  enough?  Enter  task  mapping  problem  

14  

Page 16: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

Beyond  scale  freedom  

Mul;-­‐objec;ve  op;misa;on    [PACT08]:  

– adapt  concurrency  – control  other  power  knobs  at  a  fine  granularity  (per  task,  in  microseconds,…)  

– Applicable  to  DCT,  DVFS,  thread  mapping,  data  placement…  

1.6 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.63.610

100

200

Processor Performance (P) States [DVFS]

Aver

age

Ener

gy p

er T

ask

[Jou

les]

Processor DVFS Impact on the Average Task Energy [Multisort]

1.6 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.63.610

2

4x 102

Exec

utio

n Ti

me

per T

ask

[mic

rose

c]

1.6 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.63.6120

40

60

Processor Performance (P) States [DVFS]

Aver

age

Ener

gy p

er T

ask

[Jou

les]

Processor DVFS Impact on the Average Task Energy [Blackscholes]

1.6 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.63.610.5

1

1.5x 102

Exec

utio

n Ti

me

per T

ask

[mic

rose

c]

1

3

4

2

15  

Page 17: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

Taming  locality  issues  Original  DCT  work  failed  to  capture  implica;ons  of  thread  migra;on    

Example:  Up  to  45%  execu;on  ;me  varia;on  across  85  mappings  

16,  4-­‐core  nodes:  63  million  mappings.  1000,  4-­‐core  nodes:  1043  mappings.  

4,  4-­‐core  nodes:  43,680  mappings.  

16  

Page 18: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

DyNUMA  training  using  ANN  

Op;mize  for  concurrency,  ver;cal  and  horizontal  locality      [IISWC12,  SIGMETRICS  PER]   17  

Page 19: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

Modeling  accuracy  

0%

20%

40%

60%

80%

100%

120%

140%

160%

180%

0

5

10

15

20

25

30

35

Predicted Wall-clock time Measured Wall-clock time Normalized Prediction

Tim

e(se

cond

)

Nor

mal

ized

Val

ue(%

)

18  

Page 20: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

Performance  on  TilePro64  

•  Tile64Pro  OS  default  Linux  mapping  is  inefficient    •  More  concurrency    does  not  necessary  improve  performance    •  Counter-­‐intui;ve  mappings  op;mise  energy-­‐efficiency  

0%  10%  20%  30%  40%  50%  60%  70%  80%  90%  100%  

NPB.FT.1   AMG.Relax   AMG.Matvec  

Normalize

d  Time,  EDP

   (%)  

Wall-­‐clock  Time   EDP  

19  

Page 21: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

Is  controlling  concurrency,  mapping  and  waste  at  one  program  level  enough?    

20  

Page 22: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

Energy-­‐Aware  Hybrid  Programming  

Slack  dispersion  algorithms  [IPDPS10,TPDS13]  

Task  i  

Task  j  

Task  k  

1 2 2 3 1 3 1 2

1 2 2 3 3 1 1 2

1 2 2 3 1 3 1 2

21  

Page 23: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

Cri;cal  path  based  modeling  

Predic;ng  ;me  vs.  predic;ng  scaling  func;on  

ti = min1≤ thr ≤X⋅Y

ti, j,thrj=1

M

tc =max1≤i≤Nmin

1≤ thr ≤X⋅Yti, j,thr

j=1

M

22  

Page 24: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

Time  modeling  enables  slack  dispersion  

Slack  dispersing  DCT&DVFS  [IPDPS10,TPDS13]  

Use  cri;cal  path  ;me  to  determine  slack  (essen;ally  imbalance)  

Time  constraint:  

Energy  constraint:  

Δtislack = tc − ti − ti

comm − tdvfs

Δtijk ≤ Δtislack

1≤ j≤M∑

tijk fk ≤ ti1≤ j≤M∑ f0

23  

Page 25: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

Performance  Evalua;on  

Consistent  (or  improving)  energy  savings  with  weak  and  strong  scaling 24  

Page 26: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

Have  we  solved  the  problem?  

25  

Page 27: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

How  much  parallelism  is  (really)  there  ?  

26  

ref frag ddup cmp out

ddup cmp out

ddup cmp out

ref ddup cmp out

ddup cmp out

(a) Nested pipelines

lists

ref frag ddup cmp out

ddup cmp out

ddup cmp out

ref ddup cmp out

ddup cmp out

local queue

local queue

write queue

(b) Positioning of hyperqueues

1 void Fragment( pushdep<chunk t ⇤>write queue ) {2 while( more coarse fragments ) {3 chunk t ⇤ chunk = ...;4 { // Set up inner pipeline with local queue

5 hyperqueue<chunk t⇤> ⇤ q6 = new hyperqueue<chunk t ⇤>;7 spawn FragmentRefine(8 chunk, (pushdep<chunk t ⇤>)⇤q );9 spawn DeduplicateAndCompress(

10 (popdep<chunk t ⇤>)⇤q,11 (pushdep<chunk t ⇤>)write queue );12 }13 }14 sync;15 }16 int main() {17 hyperqueue<chunk t⇤> write queue;18 spawn Fragment( (pushdep<chunk t⇤>)write queue );19 spawn Output( (popdep<chunk t⇤>)write queue );20 sync;21 }

(c) Hyperqueue implementation of dedup.

Figure 10: Alternative implementation choices for dedup. The graphics (a) and (b) show dynamic instantia-tions of each pipeline stage, how they are grouped and where collections of data elements are used. Dashedlines indicate instances of the inner pipeline. (c) Sketch of hyperqueue code according to (b).

Figure 10 (a) shows the dynamic instantiations of allpipeline stages. Two large chunks have been found, wherethe first is further split in three small chunks and the latteris split two-ways. This graphic demonstrates a shortcomingof the nested pipeline approach: all the small chunks for alarge chunk must be completed and gathered on a list be-fore the output stage can proceed. This puts an importantlimit to scalability, as the number of small chunks per in-ner pipeline is typically 500-600 and may run up to 65537,potentially resulting in long and skewed delays.

Hyperqueues allow consuming elements concurrently topushes, removing the wait times of the output stage un-til large chunks have been fully processed as in the case ofnested pipelines. Moreover, like Cilk++ list reducers, hyper-queues allow us to construct parts of the list concurrentlyand merge list segments as appropriate. This way, all nestedpipelines can push elements on the same hyperqueue and thewrite actions become synchronized and ordered between in-vocations of the nested pipeline. Finally, hyperqueues can beused directly as a drop-in replacement for lists, as they sup-port the required push and pop operations (Figure 10 (b)).

Our hyperqueue implementation inserts a local hyperqueuebetween the FragmentRefine stage and the Deduplicationstage. Also, all instances of the Deduplication and Com-press stages that correspond to the same nested pipeline(large chunk) are merged into a single sequential task. Thisdesign was chosen to coarsen the tasks and reduce dynamicscheduling overhead (which is absent in the pthreads imple-mentation). Ample parallelism remains in the program.

Our formulation of dedup follows the original sequentialalgorithm, which greatly a↵ects programmer productivity.Figure 10 (c) shows a sketch, where the main procedurespawns two tasks Fragment and Output. Fragment calls allbut the output stage in a recursive manner: whenever alarge chunk is constructed, a nested pipeline is created using

0"

1"

2"

3"

4"

5"

6"

7"

0" 5" 10" 15" 20" 25" 30" 35"

Speedu

p&

Number&of&cores&

Pthreads" TBB"

Objects" Hyperqueue"

Figure 11: Dedup speedup with various program-ming models.

two tasks that communicate through a local hyperqueue.Completed small chunks are produced on the write queue.In contrast, the TBB version of dedup requires significantrestructuring of the code in order to match the structureimposed by TBB.Note that the hyperqueue enforces dependences across

procedure boundaries. This is an e↵ect that is hard toachieve in Swan, where dataflow dependences can exist onlywithin the scope of a procedure.Figure 11 shows speedup for dedup in the pthreads, TBB

and Swan programming models. While Reed et al demon-strated improved performance of their TBB implementationrelative to the pthreads implementation in PARSEC 2.1 [22],our evaluation using PARSEC 3.0 shows that the TBB im-plementation is slower than the pthreads implementation.The Swan implementation with hyperqueues outperformsthe pthread version by at least 12% and up to 30% in the re-gion of 6-8 threads. The hyperqueue implementation loosessome of its advantage for 22 threads and higher due to taskgranularity and locality issues.

xPOTRF

xPOTRF

xPOTRF

xPOTRF

xPOTRF

xTRSM

xTRSM xTRSM xTRSM

xTRSM xTRSM xTRSM

xTRSM xTRSM

xTRSM

xSYRK

xSYRK xSYRK xSYRK

xSYRK xSYRK xSYRK

xSYRK xSYRK

xSYRK

xGEMMxGEMM xGEMM xGEMM xGEMM xGEMM

xGEMM xGEMM xGEMM

xGEMM

Fig. 2. Task graph of tile Cholesky factorization (5⇥ 5 tiles).

and multiGPU systems that can enable applications to fully exploit the powerthat each of the hybrid components o↵ers.

4.1 Hybridization of DLA algorithms

We split the computation into sub-tasks and schedule their execution over thesystem’s hybrid components. The splitting itself is simple, as it is based on split-ting BLAS operations. The challenges are choosing the granularity (and shape)of the splitting and the subsequent scheduling of the sub-tasks. It is desiredthat the splitting and scheduling (1) allow for asynchronous execution and loadbalance among the hybrid components, and (2) harness the strengths of the com-ponents of a hybrid architecture by properly matching them to algorithmic/taskrequirements. We call this process hybridization of DLA algorithms. We havedeveloped hybrid algorithms for both one-sided [15, 16] and two-sided factor-izations [6]. Those implementations have been released in the current MAGMAlibrary [17]. The task granularity is one of the key for an e�cient and balancedexecution. It is parameterized and tuned empirically at software installation time[18].

4.2 Scheduling of hybrid DLA algorithms

The scheduling on a parallel machine is crucial for the e�cient execution of analgorithm. In general, we aim to schedule the execution of the critical path of

©  Ltaief  et  al.,  LAPACK  Note  #223    

Page 28: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

Emerging  scale-­‐free  programming  models  

•  Annotate  task  memory  footprint  and  side-­‐effect  

input (rd-only), inout (rw), output (wr-only)!

•  Discover  task  dependences  at  run-­‐;me  –  dynamically  extract  task  parallelism  

–  schedule  tasks  out-­‐of-­‐order  –  E.g.  “depend”  clause  in  OpenMP  4.0  RC2  (March  2013)  

27  

xPOTRF

xPOTRF

xPOTRF

xPOTRF

xPOTRF

xTRSM

xTRSM xTRSM xTRSM

xTRSM xTRSM xTRSM

xTRSM xTRSM

xTRSM

xSYRK

xSYRK xSYRK xSYRK

xSYRK xSYRK xSYRK

xSYRK xSYRK

xSYRK

xGEMMxGEMM xGEMM xGEMM xGEMM xGEMM

xGEMM xGEMM xGEMM

xGEMM

Fig. 2. Task graph of tile Cholesky factorization (5⇥ 5 tiles).

and multiGPU systems that can enable applications to fully exploit the powerthat each of the hybrid components o↵ers.

4.1 Hybridization of DLA algorithms

We split the computation into sub-tasks and schedule their execution over thesystem’s hybrid components. The splitting itself is simple, as it is based on split-ting BLAS operations. The challenges are choosing the granularity (and shape)of the splitting and the subsequent scheduling of the sub-tasks. It is desiredthat the splitting and scheduling (1) allow for asynchronous execution and loadbalance among the hybrid components, and (2) harness the strengths of the com-ponents of a hybrid architecture by properly matching them to algorithmic/taskrequirements. We call this process hybridization of DLA algorithms. We havedeveloped hybrid algorithms for both one-sided [15, 16] and two-sided factor-izations [6]. Those implementations have been released in the current MAGMAlibrary [17]. The task granularity is one of the key for an e�cient and balancedexecution. It is parameterized and tuned empirically at software installation time[18].

4.2 Scheduling of hybrid DLA algorithms

The scheduling on a parallel machine is crucial for the e�cient execution of analgorithm. In general, we aim to schedule the execution of the critical path of

©  Ltaief  et  al.,  LAPACK  Note  #223    

Page 29: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

BeJer  concurrency  control  saves  energy  

28  

cilities to them for tracking inter-task dependences. Also,automatic memory management is applied to versioned ob-jects to break write-after-read dependences. Versioned vari-ables may be used as procedure arguments provided they arecast to type indep, outdep or inoutdep, which describesside e↵ects of reading, writing or both. The spawn keywordindicates that calling a task may occur in parallel with thecontinuation of the calling procedure, as in Cilk. The synckeyword blocks a procedure until all children have finishedexecution. The loop in Figure 1 corresponds to a two-stagepipeline where instances of the produce stage may executein parallel as there are no dependences between those in-stances, while instances of the consume stage execute strictlyin order due to the dependence on the inoutdep argument.

Task dataflow is an intuitive programming model wherethe pipeline pattern emerges on-the-fly as a side-e↵ect ofthe code structure, rather than being designed-in. However,task dataflow has two limitations with respect to pipelineparallelism: (i) pipelines must be su�ciently coarse-grainedas every stage invocation is modeled as a separately sched-uled task, and (ii) each pipeline stage consumes a fixed num-ber of elements from its predecessor and produces a fixednumber of output elements [6]. This paper will addressboth shortcomings by introducing hyperqueues, a program-ming abstraction of queues for a task based programminglanguage. Hyperqueues are deterministic and allow the con-struction of scale-free pipeline parallel programs.

Hyperqueues share commonalities with Cilk++ hyperob-jects, specifically with reducers [9]. Reducers are special pro-gram variables that support reduction operations, i.e., theyare identified by a type, an identity element and an asso-ciative reduction operation. A common example is additionover integers, but also appending to a list is an associativeoperation. The latter was, in fact, the main motivation forthe development of reducers [9]. Reduction operations canbe parallelized by creating duplicates of the reduction vari-able, called views, which are private to a task. As views areprivate, they are accessed without races. When tasks com-plete, the views are reduced to a single value in such a waythat program order is respected. Moreover, Cilk++ uses a“special” optimization to reduce views only on task steals, asopposed to on all spawned tasks. Hyperqueues build on thisproperty of reducers to perform push operations in parallelwhile retaining determinism.

However, hyperqueues also allow concurrent push and popoperations and are di↵erent in this respect from Cilk++hyperobjects. To support this behavior, hyperqueues requirea distinct implementation. Views are no longer private butare shared between a producing task and a consuming task.This paper shows how to design shared views that are datarace free and how to ensure deterministic parallelism forprograms utilizing hyperqueues.

Using hyperqueues, we parallelize several benchmarks withless programming e↵ort than using POSIX threads or Thread-ing Building Blocks (TBB) because synchronization is hid-den in the runtime system and because the programminglanguage does not impose a stringent format, as TBB does.Moreover, the hyperqueue version is scale-free and obtainsthe same or up to 30% better performance. It also out-performs task dataflow languages like [6] because the lattercannot capture varying numbers of inputs and outputs.

The remainder of this paper is organized as follows. Sec-tion 2 discusses the programming model. Section 3 discusses

1 struct data { ... };2 void consumer(popdep<data> queue) {3 while( !queue.empty() ) {4 data d = queue.pop();5 // ... operate on data ...

6 }7 }8 void producer(pushdep<data> queue, int start, int end) {9 if ( end�start <= 10 ) {

10 for( int n=start; n < end; ++n ) {11 data d = f(n);12 queue.push(d);13 }14 } else {15 spawn producer(queue, start , ( start +end)/2);16 spawn producer(queue, (start+end)/2, end);17 sync;18 }19 }20 void pipeline ( int total ) {21 hyperqueue<data> queue;22 spawn producer((pushdep<data>)queue, 0, total);23 spawn consumer((popdep<data>)queue);24 sync;25 }Figure 2: The simple pipeline-parallel program ofFigure 1 expressed with the hyperqueue.

the internal representation of hyperqueues in the runtimesystem and views. Section 4 discusses how the runtime sys-tem merges views. Then, Section 5 presents programmingidioms. We present an experimental evaluation in Section 6.Finally, Section 7 discusses related work and Section 8 con-cludes this paper.

2. PROGRAMMING MODEL

2.1 The Hyperqueue Abstraction of QueuesHyperqueues are a programming abstraction for queues.

A queue is an ordered sequence of values. Values are addedto the tail of the sequence using a push method. Values areremoved from the head of the sequence using a pop method.We define a hyperqueue as a special object in our pro-

gramming language that models a single-producer, single-consumer queue. Its implementation allows tasks to concur-rently push and pop values without breaking the semanticsof a single-producer, single-consumer queue, and withoutbreaking the serializability of the parallel program.Hyperqueues are defined as variables of type hyperqueue,

which takes a type parameter to describe the type of the val-ues stored in the queue. Hyperqueues may be passed to pro-cedures provided they are cast to a type that describes theaccess mode of the procedure. This type can be pushdep,popdep or pushpopdep, to indicate that the spawned pro-cedure may only push values on the queue, that it may onlypop values from the queue, or that it may do both. A taskwith push access mode is not required to push any values,nor is a task with pop access mode required to pop all val-ues from the queue. A hyperqueue may be destroyed withvalues still inside.A simple 2-stage pipeline using the hyperqueue is shown in

Figure 2. The procedure pipeline at line 20 creates a hyper-

Hyperqueues  express  and  control  data-­‐dependent  parallelism  in  variable-­‐rate  pipelines  [SC13]  

ref frag ddup cmp out

ddup cmp out

ddup cmp out

ref ddup cmp out

ddup cmp out

(a) Nested pipelines

lists

ref frag ddup cmp out

ddup cmp out

ddup cmp out

ref ddup cmp out

ddup cmp out

local queue

local queue

write queue

(b) Positioning of hyperqueues

1 void Fragment( pushdep<chunk t ⇤>write queue ) {2 while( more coarse fragments ) {3 chunk t ⇤ chunk = ...;4 { // Set up inner pipeline with local queue

5 hyperqueue<chunk t⇤> ⇤ q6 = new hyperqueue<chunk t ⇤>;7 spawn FragmentRefine(8 chunk, (pushdep<chunk t ⇤>)⇤q );9 spawn DeduplicateAndCompress(

10 (popdep<chunk t ⇤>)⇤q,11 (pushdep<chunk t ⇤>)write queue );12 }13 }14 sync;15 }16 int main() {17 hyperqueue<chunk t⇤> write queue;18 spawn Fragment( (pushdep<chunk t⇤>)write queue );19 spawn Output( (popdep<chunk t⇤>)write queue );20 sync;21 }

(c) Hyperqueue implementation of dedup.

Figure 10: Alternative implementation choices for dedup. The graphics (a) and (b) show dynamic instantia-tions of each pipeline stage, how they are grouped and where collections of data elements are used. Dashedlines indicate instances of the inner pipeline. (c) Sketch of hyperqueue code according to (b).

Figure 10 (a) shows the dynamic instantiations of allpipeline stages. Two large chunks have been found, wherethe first is further split in three small chunks and the latteris split two-ways. This graphic demonstrates a shortcomingof the nested pipeline approach: all the small chunks for alarge chunk must be completed and gathered on a list be-fore the output stage can proceed. This puts an importantlimit to scalability, as the number of small chunks per in-ner pipeline is typically 500-600 and may run up to 65537,potentially resulting in long and skewed delays.

Hyperqueues allow consuming elements concurrently topushes, removing the wait times of the output stage un-til large chunks have been fully processed as in the case ofnested pipelines. Moreover, like Cilk++ list reducers, hyper-queues allow us to construct parts of the list concurrentlyand merge list segments as appropriate. This way, all nestedpipelines can push elements on the same hyperqueue and thewrite actions become synchronized and ordered between in-vocations of the nested pipeline. Finally, hyperqueues can beused directly as a drop-in replacement for lists, as they sup-port the required push and pop operations (Figure 10 (b)).

Our hyperqueue implementation inserts a local hyperqueuebetween the FragmentRefine stage and the Deduplicationstage. Also, all instances of the Deduplication and Com-press stages that correspond to the same nested pipeline(large chunk) are merged into a single sequential task. Thisdesign was chosen to coarsen the tasks and reduce dynamicscheduling overhead (which is absent in the pthreads imple-mentation). Ample parallelism remains in the program.

Our formulation of dedup follows the original sequentialalgorithm, which greatly a↵ects programmer productivity.Figure 10 (c) shows a sketch, where the main procedurespawns two tasks Fragment and Output. Fragment calls allbut the output stage in a recursive manner: whenever alarge chunk is constructed, a nested pipeline is created using

0"

1"

2"

3"

4"

5"

6"

7"

0" 5" 10" 15" 20" 25" 30" 35"

Speedu

p&

Number&of&cores&

Pthreads" TBB"

Objects" Hyperqueue"

Figure 11: Dedup speedup with various program-ming models.

two tasks that communicate through a local hyperqueue.Completed small chunks are produced on the write queue.In contrast, the TBB version of dedup requires significantrestructuring of the code in order to match the structureimposed by TBB.Note that the hyperqueue enforces dependences across

procedure boundaries. This is an e↵ect that is hard toachieve in Swan, where dataflow dependences can exist onlywithin the scope of a procedure.Figure 11 shows speedup for dedup in the pthreads, TBB

and Swan programming models. While Reed et al demon-strated improved performance of their TBB implementationrelative to the pthreads implementation in PARSEC 2.1 [22],our evaluation using PARSEC 3.0 shows that the TBB im-plementation is slower than the pthreads implementation.The Swan implementation with hyperqueues outperformsthe pthread version by at least 12% and up to 30% in the re-gion of 6-8 threads. The hyperqueue implementation loosessome of its advantage for 22 threads and higher due to taskgranularity and locality issues.

Page 30: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

Task  dataflow  and  locality  

•  Rich  seman;c  informa;on  available  to  the  compiler  and  run;me  system  – DAG,  program  order  for  correctness  and  determinism,  task  memory  footprints  for  locality  

•  Opportunity  to  make  memory  system  aware  of  working  sets  – Run;me  explicitly  manages  the  memory  hierarchy  by  placing  task  footprints  in  caches  

29  

Page 31: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

 Overlooking  the  memory  hierarchy  

30  

Experimental SetupMicrobenchmarks

BottomlineConclusions

MethodologyMicrobenchmarksComparisonAccess Stride

Results - L3 Cache Sensitive Kernels

6/0 6/1 6/2 6/3 6/4 6/5 6/6 5/6 4/6 3/6 2/6 1/6 0/60

0.5

1

1.5

2

2.5

3

3.5

4x 10

−9 Energy Per Instruction for Various Operational Intensities

Operational Intensity (Byte / Arithmetic)

Dyn

am

ic E

ne

rgy

pe

r In

stru

ctio

n (

J/I)

CPUCache

Fi g u r e : EP I wh i l e t r a ve r s i n g O I o f aL 3 Ca c h e s e n s i t i ve wo r k l o a d .

O b s e r va t i o n s1 O I Co u n t s L 3 a c c e s s e s

i n s t e a d o f m e m o r y o n e s .

2 L 3 a c c e s s e s a l s o d e g r a d ee n e r g y e � c i e n c y f o r h i g h O I .

3 Ca c h e H i e r a r c y c o n s u m e s u pt o 50% of the totalenergy.

Ioannis Manousakis and Dimitrios S. Nikolopoulos Measuring and Modeling Energy with BTL

[SBAC-­‐PAD’12]  

Page 32: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

Overlooking  the  memory  hierarchy  

31  

Experimental SetupMicrobenchmarks

BottomlineConclusions

MethodologyMicrobenchmarksComparisonAccess Stride

Results - Comparison

Workload OI EPI Against L3

L3 High 3.9⇥�9 1Throughput High 1.18⇥�8 3.02Latency High 5.8⇥�8 14.9

L3 Low 2.4⇥�9 1Throughput Low 4.0⇥�9 1.6Latency Low 3.6⇥�8 15

Table: EPI comparison of throughput, latency and L3 sensitive workloads.

Ioannis Manousakis and Dimitrios S. Nikolopoulos Measuring and Modeling Energy with BTL

Page 33: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

Cache  management  using  task  life;mes  

32  

curr?

+ +

= next?=

Tags Data Tags Data Tags Data Tags Data

EpochAddress

3

2

1

Curr. Epoch

Curr. Quota

Next Quota

ECMReplacement

Decisions

•  Epoch  quotas:  cache  space  alloca;on  per  task  (best-­‐effort)  –  SW  declares  quota  from  task  footprint  size  (ECM  converts  to  ways)  –  when  current  and  next  compete  è  guarantee  minimum  alloca;on  

•  Replacement:  computes  current  &  next  occupancy  (per-­‐set)  –  replace  from  reques;ng  epoch  when  set  is  full  (e.g.  use  LRU  bits)  –  throJle  EBP  (prefetching)  when  set  is  full  and  epoch  exceeded  quota.  

[ICS13]  

Page 34: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

BeJer  locality  cuts  down  energy  consump;on  

33  

Jacobi,  Sparse-­‐LU:  memory-­‐intensive  codes,  medium  or  low  OI  Energy  savings  of  20%-­‐30%  

Page 35: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

Should  the  programmer  care  about  energy-­‐efficiency?  

34  

Joe  the  Plumber  Green  Programmer  

Page 36: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

 Energy  and  the  programmer  •  Programmers  must  go  back  to  half  a  century-­‐-­‐old  principles  

–  Eliminate  waste!  –  Much  of  the  programming  we  do  already  does  this  

•  Load  balancing  •  Communica;on  or  synchronisa;on  removal  

•   Scale-­‐free  programming  models  can  help  programmers  achieve  this  –  Programmer  expresses  exact  parallelism  and  locality  paJerns  –  Run;me  system  maps  to  cores,  memories  and  interconnect  so  as  to  avoid  waste  

•  Domain-­‐specific  knowledge  can  further  save  energy  

35  

Page 37: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

The  “Lernaia  Hydra”  •  Power  instrumenta;on    is  inaccurate,  intrusive,  coarse-­‐grain,…  

–  SoGware  is  at  the  mercy  of  hardware  (PMCs,  sensors,  voltage  regulators,  everything  machine-­‐specific,…)  

•  No  soGware  standards  for  power  –  How  would  power  knobsmake  i  into  MPI,  OpenMP,  Cilk,  PGAS,  or  even  

mainstream  languages?  •  What  if  a  power  cap  is  imposed?  

–  And  violated?  •  Riding  the  technology  curve  is  dangerous  

–  Low  voltage  may  become  sub-­‐threshold  voltage  –  Subthreshold  voltage  will  increase  soG  error  rate  –  SoG  errors  will  cause  failures  

36  

Page 38: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

Looking  forward:  SCoRPiO  project  •  Compu;ng  at  the  

limits  of  energy  and  reliability  

•  Embrace  uncertainty!  –  Not  all  bits  in  memory  

and  registers  are  equally  cri;cal    

–  Applica;on-­‐specific  quality  control  

•  Minimise  power  by  scaling  gracefully  under  hardware  errors    –  Scale-­‐free  parallel  

programming  

37  

Page 39: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

Acknowledgments  

38  

Page 40: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

Our  resources  

39  

Page 41: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

More  informa;on  

hJp://www.qub.ac.uk/research-­‐centres/HPDC/    

40  

Page 42: Programming the Energy Efficiency of High Performance

School  of  EEECS  HPDC  Cluster  

BlueGene  on  the  Green500  

0

500

1000

1500

2000

2500

3000

Nov

-07

Jan-

08

Mar

-08

May

-08

Jul-0

8

Sep

-08

Nov

-08

Jan-

09

Mar

-09

May

-09

Jul-0

9

Sep

-09

Nov

-09

Jan-

10

Mar

-10

May

-10

Jul-1

0

Sep

-10

Nov

-10

Jan-

11

Mar

-11

May

-11

Jul-1

1

Sep

-11

Nov

-11

Jan-

12

Mar

-12

May

-12

Jul-1

2

Sep

-12

Nov

-12

MFL

OPS

per

Wat

t

BlueGene and the Green-500 List

BlueGene/L-P-Q

Peak Green-500

41