1
OutofOrder Parallel Discrete Event Simula7on for Transac7on Level Models Weiwei Chen Advisor: Prof. Rainer Dömer Center for Embedded Computer Systems University of California, Irvine Center for Embedded Computer Systems © 2013 Weiwei Chen, Version 1.2 March, 2013. Selec7on of Relevant Publica7ons W. Chen, R. Dömer, "ConcurrenC: A new approach towards effec3ve abstrac3on of Cbased SLDLs”, IESS 2009. W. Chen, R. Dömer, " A Fast Heuris3c Scheduling Algorithm for Periodic ConcurrenC Models", ASPDAC 2010. W. Chen, X. Han, R. Dömer, “ESL Design and Mul3Core Valida3on using the SystemonChip Environment”, HLDVT 2010 R. Dömer, W. Chen, X. Han, Andreas Gerstlauer, “Mul3Core Parallel Simula3on of SystemLevel Descrip3on Languages”, ASPDAC 2011. W. Chen, X. Han, R. Dömer, “Mul3core Simula3on of Transac3on Level Models using the SystemonChip Environment”, IEEE Design & Test of Computers, MayJune 2011. Tradi7onal DE simula7on Synchronous PDES Outoforder PDES SimulaUon Time One global Ume tuple (t, δ) shared by every thread and event Local Ume for each thread th as tuple (tthth). A total order of Ume is defined with the following relaUons: equal: (t11) = (t22), iff t1 =t21 2 before: (t11) < (t22), iff t1 <t2, or t1 =t21 2 aEer: (t11) > (t22), iff t1 >t2, or t1 =t21 2 Event DescripUon Events are idenUfied by their ids, i.e, event (id). A Umestamp is added to idenUfy every event, i.e. event (id, t, δ). SimulaUon Thread Sets READY, RUN, WAIT, WAITFOR, JOINING, COMPLETE Threads are organized as subsets with the same Umestamp (tthth). Thread sets are the union of these subsets, i.e, READY = READYt,δ, RUN = RUNt,δ , WAIT = WAITt,δ , WAITFOR = WAITFORt,δ (δ = 0), where the subsets are ordered in increasing order of Ume (t, δ) Threading Model Userlevel or OS kernellevel OS kernellevel Run Time Scheduling Event delivery inorder in deltacycle loop Time advance inorder in outer loop Event delivery outoforder if no conflicts exist Time advance outoforder if no conflicts exist Only one thread is acUve at one Ume No parallelism No SMP u5liza5on Threads at same simulaUon cycle run in parallel Limited parallelism Inefficient SMP u5liza5on Threads at same simulaUon cycle or with no conflicts run in parallel More parallelism Efficient SMP u5liza5on Compile Time Analysis No synchronizaUon protecUon needed Need synchroniza5on protec5on for shared resources, e.g. any userdefined and hierarchical channels, data structures in the scheduler No conflict analysis needed StaUc conflict analysis derives Segment Graph (SG) from the Control Flow Graph (CFG), analyzes variable and event accesses, passes conflict tables to the scheduler. Compile 7me increases SpecC Models SystemC Models SpecC SLDLs SLDLs MoCs MoCs Abstrac7on Descrip7ve Capability Rela7onship between ConcurrenC and Cbased SLDLs SystemC ConcurrenC Source: www.apple.com Source: www.aa1car.com Embedded System Design Embedded systems are special purposed computer systems with a wide applicaUon domain, e.g. automobiles, medical devices, communicaUon systems, etc. TransacUon Level Modeling (TLM) models embedded systems at different abstracUon levels. TLMs are usually described in SystemLevel DescripUon Languages (SLDLs), e.g. SystemC and SpecC. TLMs are typically validated by Discrete Event (DE) simulaUon. Related Work: Accelerate TLM simula7on Modeling Techniques TransacUonlevel modeling (TLM) TLM temporal decoupling Sourcelevel SimulaUon Stagelmann et al. [DAC’11] HostCompiled SimulaUon Gerstlauer et al. [RSP’10] Distributed Simula7on Chandy et al. [TSE’79] Huang et al. [SIES’08] Chen et al. [CECS’11] Hardwarebased Accelera7on Sirowy et al. [DAC’10] Nanjundappa et al. [ASPDAC’10] Sinha et al. [ASPDAC’12] Vinco et al. [DAC’12] SMP Parallel Simula7on Fujimoto. [CACM’90] Chopard et al. [ICCS’06] Ezudheen et al. [PADS’09] Mello et al. [DATE’10] Schumacher et al. [CODES’11] Chen et al. [IEEED&T’11] Yun et al. [TCAD’12] Source: P. Chou, UCI – Executes threads in the same simulaUon cycles (3me, delta) in parallel – Needed protecUon of communicaUon and synchronizaUon can be automaUcally instrumented start READY == ? thWAIT, if ths event is noUfied; Move(th, WAIT, READY), Clear noUfied events; READY == ? Update the simulaUon Ume; move the earliest thWAITFOR to READY; READY == ? end No Yes No No Yes Yes th = Pick(READY, RUN); Go(th); RUN == ? |RUN| <= #CPUs && READY != ? Yes No sleep sleep No Yes deltacycle 5medcycle Image courtesy of Intel Synchronous Parallel Discrete Event Simula7on (SPDES) System-level Models B0 B1 B2 B3 • Structured model • Computation and Communication separated Explicit parallelism • Synthesizable and analyzable Multi-core CPUs readily available • Symmetric multiprocessing (SMP) • Hyper-threading support • Many-core CPUs coming Sta5c States in SPDES READY WAIT RUN WAITFOR thread is issued waits for event e event e is no7fied waits for 7me t Simula7on 7me is advanced to t #core 4 3 1 3 2 1 4 10:0 10:1 10:2 20:1 20:0 20:2 30:0 0:0 T:Δ th 4 th 2 th 3 th 1 10:0 10:1 10:2 20:1 20:0 20:2 30:0 0:0 T:Δ th 4 th 2 th 3 th 1 R. Dömer, W. Chen, X. Han, “Parallel Discrete Event Simula3on of Transac3on Level Models”, ASPDAC 2012. W. Chen, R. Dömer, “An Op3mizing Compiler for OutofOrder Parallel ESL Simula3on Exploi3ng Instance Isola3on” ASPDAC 2012. W. Chen, X. Han, R. Dömer, “Outoforder Parallel Simula3on for ESL Design”, DATE 2012. W. Chen, X. Han, C. Chang, R. Dömer, “Elimina3ng Race Condi3ons in SystemLevel Models by using Parallel Simula3on Infrastructure”, HLDVT 2012. W. Chen, X. Han, C. Chang, R. Dömer, “Advances in Parallel Discrete Event Simula3on for Electronic SystemLevel Design”, IEEE Design & Test of Computers, JanuaryFebruary 2013. W. Chen, R. Dömer, “Op3mized OutofOrder Parallel Discrete Event Simula3on Using Predic3ons” DATE 2013. thREADY, if( RUN.size < numCPUs && NoConflict(th)), Go(th); start Simula5on States Update thWAIT, if ths event is noUfied @ (tee); Move(th, WAITtth,δth, READYtee+1) thWAITFOR, Move(th, WAITFORtth,δth, READYtth,0) RUN == ? end thREADY, if( RUN.size < numCPUs && NoConflict(th)), Go(th); sleep; No Yes 1: #include <stdio.h> 2: int array[] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; 3: behavior B(in int begin, in int end, out int sum) 4: { int i; 5: void main(){ 6: int tmp; tmp = 0; i = begin; 7: waikor 1; // v3, segment 3 starts 8: while(i <= end){ 9: waikor 2; // v4, segment 4 starts 10: tmp += array[i]; i ++; 11: } 12: sum = tmp; } 13: }; 14: behavior Main() 15: {int sum1, sum2; 16: B b1(0, 4, sum1); B b2(5, 9, sum2); 17: int main(){ 18: par{ // v1, segment 1 starts 19: b1.main(); 20: b2.main(); 21: } // v2, segment 2 starts 22: prink("summa7on 1 is :%d \n", sum1); 23: prink("summa7on 2 is :%d \n", sum2); }}; Outoforder Parallel Discrete Event Simula7on (OoO PDES) – Breaks simulaUon cycle barriers by localizing the Ume for each thread – Aggressively issues threads to run in parallel even in different cycles – Preserves the simulaUon correctness and Uming accuracy waiHor 1 (Main.b1) seg3 seg4 thread_b1 par par end seg0 thread_Main seg2 thread_Main seg1 waiHor 1 (Main.b2) seg1 seg5 seg6 thread_b2 waiHor 2 (Main.b1) waiHor 2 (Main.b2) seg 0 1 2 3 4 5 6 0 F F F F F F F 1 F T F T T T T 2 F F F T T T T 3 F T T T T F F 4 F T T T T F F 5 F T T F F T T 6 F T T F F T T • StaUc Conflict Analysis – Compiler builds Segment Graph (SG) derived from Control Flow Graph (CFG) of applicaUons – Compiler builds Segment Conflict Tables for quick lookup at runUme Example Segment Graph • OoO PDES OpUmizaUons Instance isola5on reduces false conflicts [ASPDAC’12] – Scheduling predic5on for more parallelism [DATE’13] • StaUc code analysis applicaUons – Conflicts detecUon for recoding – Debugging when parallelizing applicaUons [HLDVT’12] Segment Data Conflict Table 10:0 10:1 10:2 20:1 20:0 20:2 30:0 0:0 T:Δ th 4 th 2 th 3 th 1 #core 4 4 4 4 4 3 1 th 4 th 2 th 3 th 1 shortend sim run 7me Dynamic States in OoO PDES READYt,δ WAITt,δ RUNt,δ READYt,δ+1 READYt+d1,0 WAITFORt+d2,0 READYt+d1,1 WAITt,δ+1 WAITt+d1,0 RUNt,δ+1 RUNt+d1,0 RUNt+d1,1 WAITFORt+d1,0 0 10 20 30 40 50 0 200 400 600 800 1000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 OutofOrder Parallel Simula7on with Predic7ons (H.264 encoder) sim run7me (sec) comp 7me(sec) Experiments and Results Parallel Fibonacci calculaUon with Uming informaUon Parallel JPEG image encoder with 3 color components encoded in parallel, a sequenUal Huffman encoding Parallel H.264 video decoder with 4 slice decoders and a sequenUal slice reader and synchronizer Host: 64bit Fedora 12 Linux with 2 6core CPUs (Intel(R) Xeon(R) X5650) at 2.67 GHz with 2 hyperthreads per core (supports 24 threads in parallel) 0 0.5 1 1.5 2 2.5 3 spec arch sched net Simula7on Run7me for the JPEG encoder model [sec] Tradi7onal sequen7al Synchronous PDES OoO PDES 0 0.5 1 1.5 2 2.5 3 3.5 spec arch sched net Compila7on 7me for the JPEG encoder model [sec] Tradi7onal sequen7al Synchronous PDES OoO PDES 0 50 100 150 200 spec arch sched net Simula7on run7me for the H.264 decoder model [sec] Tradi7onal sequen7al Synchronous PDES OoO PDES 10 10.5 11 11.5 12 12.5 13 13.5 spec arch sched net Compila7on 7me for the H.264 decoder model [sec] Tradi7onal sequen7al Synchronous PDES OoO PDES 0 2 4 6 8 10 12 14 0 5 10 15 20 25 30 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Parallel Fibonacci calculation with timing information fibo_timed elapsed time (SPDES) [sec] fibo_timed elapsed time (OoO PDES) [sec] fibo_timed rel. speedup (SPDES) fibo_timed rel. speedup (OoO PDES)

Out$of$Order*Parallel*Discrete*Event*Simula7on*for*Transac7on*Level*Models* …weiweic/publications/OoOPDES_Poster.pdf · 2014. 3. 3. · Out$of$Order*Parallel*Discrete*Event*Simula7on*for*Transac7on*Level*Models*

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Out$of$Order*Parallel*Discrete*Event*Simula7on*for*Transac7on*Level*Models* …weiweic/publications/OoOPDES_Poster.pdf · 2014. 3. 3. · Out$of$Order*Parallel*Discrete*Event*Simula7on*for*Transac7on*Level*Models*

Out-­‐of-­‐Order  Parallel  Discrete  Event  Simula7on  for  Transac7on  Level  Models  Weiwei  Chen  

Advisor:  Prof.  Rainer  Dömer  Center  for  Embedded  Computer  Systems  

University  of  California,  Irvine  

Center  for  Embedded  Computer  Systems  

© 2013 Weiwei Chen, Version 1.2 March, 2013.

v  Selec7on  of  Relevant  Publica7ons  •  W.  Chen,  R.  Dömer,  "ConcurrenC:  A  new  approach  towards  effec3ve  abstrac3on  of  C-­‐based  SLDLs”,  IESS  2009.    •  W.  Chen,  R.  Dömer,  "  A  Fast  Heuris3c  Scheduling  Algorithm  for  Periodic  ConcurrenC  Models",  ASP-­‐DAC  2010.  •  W.  Chen,  X.  Han,  R.  Dömer,  “ESL  Design  and  Mul3-­‐Core  Valida3on  using  the  System-­‐on-­‐Chip  Environment”,  HLDVT  2010    •  R.  Dömer,  W.  Chen,  X.  Han,  Andreas  Gerstlauer,    “Mul3-­‐Core  Parallel  Simula3on  of  System-­‐Level  Descrip3on  Languages”,  ASP-­‐DAC  2011.  •  W.  Chen,  X.  Han,  R.  Dömer,  “Mul3-­‐core  Simula3on  of  Transac3on  Level  Models  using  the  System-­‐on-­‐Chip  Environment”,    

IEEE  Design  &  Test  of  Computers,  May-­‐June  2011.  

Tradi7onal  DE  simula7on     Synchronous  PDES     Out-­‐of-­‐order  PDES  

SimulaUon  Time     One  global  Ume  tuple  (t,  δ)    shared  by  every  thread  and  event    

Local  Ume  for  each  thread  th  as  tuple  (tth,δth).  A   total   order   of   Ume   is   defined   with   the   following   relaUons:  equal:  (t1,  δ1)  =  (t2,  δ2),  iff  t1  =  t2,  δ1  =  δ2  before:  (t1,δ1)  <  (t2,δ2),  iff  t1  <  t2,  or  t1  =  t2,  δ1  <  δ2    aEer:  (t1,δ1)  >  (t2,δ2),  iff  t1  >  t2,  or  t1  =  t2,  δ1  >  δ2    

Event  DescripUon     Events  are  idenUfied  by  their  ids,  i.e,  event  (id).     A  Umestamp  is  added  to  idenUfy  every  event,  i.e.  event  (id,  t,  δ).    

SimulaUon  Thread  Sets    READY,  RUN,  WAIT,  WAITFOR,    JOINING,  COMPLETE    

Threads  are  organized  as  subsets  with  the  same  Umestamp    (tth,  δth).  Thread  sets  are  the  union  of  these  subsets,    i.e,  READY  =  ∪READYt,δ,  RUN  =  ∪RUNt,δ  ,  WAIT  =  ∪WAITt,δ  ,  WAITFOR  =  ∪WAITFORt,δ  (δ  =  0),  where    the  subsets  are  ordered  in  increasing  order  of  Ume  (t,  δ)  

Threading  Model     User-­‐level  or  OS  kernel-­‐level     OS  kernel-­‐level    

Run  Time  Scheduling    

Event  delivery  in-­‐order  in  delta-­‐cycle  loop    Time  advance  in-­‐order  in  outer  loop    

Event  delivery  out-­‐of-­‐order  if  no  conflicts  exist    Time  advance  out-­‐of-­‐order  if  no  conflicts  exist    

Only  one  thread  is    acUve  at  one  Ume    No  parallelism  No  SMP  u5liza5on    

Threads   at   same   simulaUon  cycle  run  in  parallel    Limited  parallelism    Inefficient  SMP  u5liza5on  

Threads  at  same  simulaUon  cycle    or  with  no  conflicts  run  in  parallel    More  parallelism  Efficient  SMP  u5liza5on    

Compile  Time  Analysis    

No  synchronizaUon    protecUon  needed    

Need  synchroniza5on  protec5on  for  shared  resources,  e.g.  any  user-­‐defined  and  hierarchical  channels,  data  structures  in  the  scheduler    

No  conflict  analysis  needed  StaUc   conflict   analysis   derives   Segment   Graph   (SG)   from   the  Control  Flow  Graph  (CFG),  analyzes  variable  and  event  accesses,  passes  conflict  tables  to  the  scheduler.  Compile  7me  increases    

SpecC  Models  

SystemC  Models  

SpecC  SLDLs   SLDLs  

MoCs  MoCs  

Abstrac7on   Descrip7ve  Capability  

Rela7onship  between    ConcurrenC  and  C-­‐based  SLDLs  

SystemC  

ConcurrenC  

Source: www.apple.com Source: www.aa1car.com

v Embedded  System  Design    •  Embedded  systems  are  special  purposed  computer  systems    

with  a  wide  applicaUon  domain,  e.g.  automobiles,  medical  devices,  communicaUon  systems,  etc.    

•  TransacUon  Level  Modeling  (TLM)  models    embedded  systems  at  different  abstracUon  levels.  

•  TLMs  are  usually  described  in  System-­‐Level  DescripUon  Languages  (SLDLs),  e.g.  SystemC  and  SpecC.  

•  TLMs  are  typically  validated  by  Discrete  Event  (DE)    simulaUon.  

v Related  Work:  Accelerate  TLM  simula7on  Modeling  Techniques  

•  TransacUon-­‐level  modeling  (TLM)  •  TLM  temporal  decoupling  •  Source-­‐level  SimulaUon              Stagelmann  et  al.  [DAC’11]  •  Host-­‐Compiled  SimulaUon              Gerstlauer  et  al.  [RSP’10]  

Distributed  Simula7on  •  Chandy  et  al.  [TSE’79]  •  Huang  et  al.  [SIES’08]  •  Chen  et  al.  [CECS’11]  

Hardware-­‐based  Accelera7on  •  Sirowy  et  al.  [DAC’10]  •  Nanjundappa  et  al.  [ASPDAC’10]  •  Sinha  et  al.  [ASPDAC’12]    •  Vinco  et  al.  [DAC’12]  

SMP  Parallel  Simula7on  •  Fujimoto.  [CACM’90]      •  Chopard  et  al.  [ICCS’06]    •  Ezudheen  et  al.  [PADS’09]      •  Mello  et  al.  [DATE’10]  •  Schumacher  et  al.  [CODES’11]  •  Chen  et  al.  [IEEED&T’11]      •  Yun  et  al.  [TCAD’12]    

Source: P. Chou, UCI

–  Executes  threads  in  the  same  simulaUon  cycles  (3me,  delta)  in  parallel  –  Needed  protecUon  of  communicaUon  and  synchronizaUon    can  be  automaUcally  instrumented  

start  

READY  ==  ∅  ?  

∀th∈WAIT,  if  th’s  event  is  noUfied;  Move(th,  WAIT,  READY),  Clear  noUfied  events;    

READY  ==  ∅  ?  

Update  the  simulaUon  Ume;  move  the  earliest  th∈WAITFOR  to  READY;    

READY  ==  ∅  ?  

end  

No  

Yes  

No  

No  

Yes  

Yes  

th  =  Pick(READY,  RUN);  Go(th);  

RUN  ==  ∅  ?  |RUN|  <=  #CPUs    &&  READY  !=  ∅  ?  

Yes  

No  sleep  

sleep  

No  

Yes  

delta-­‐cycle  

5med-­‐cycle  

Image courtesy of Intel

v Synchronous  Parallel  Discrete  Event  Simula7on  (SPDES)  System-level Models

B0   B1  

B2   B3  

• Structured model • Computation and Communication separated • Explicit parallelism • Synthesizable and analyzable

• Multi-core CPUs readily available • Symmetric multiprocessing (SMP) • Hyper-threading support • Many-core CPUs coming

Sta5c  States  in  SPDES  

READY  

WAIT  

RUN  

WAITFOR  

thread  is    issued  

waits  for    event  e  

event  e    is  no7fied  

waits  for    7me  t  

Simula7on    7me  is  

advanced    to  t  

#core  

4  

3  

1  

3  

2  1  

4  

10:0  10:1  

10:2  

20:1  20:0  

20:2  

30:0  

0:0  T:Δ  th4  th2   th3  th1  

10:0  10:1  

10:2  

20:1  20:0  

20:2  

30:0  

0:0  T:Δ  th4  th2   th3  th1  

•  R.  Dömer,    W.  Chen,  X.  Han,  “Parallel  Discrete  Event  Simula3on  of  Transac3on  Level  Models”,  ASP-­‐DAC  2012.  •  W.  Chen,  R.  Dömer,    “An  Op3mizing  Compiler  for  Out-­‐of-­‐Order  Parallel  ESL  Simula3on  Exploi3ng  Instance  Isola3on”  ASP-­‐DAC  2012.  •  W.  Chen,  X.  Han,  R.  Dömer,    “Out-­‐of-­‐order  Parallel  Simula3on  for  ESL  Design”,  DATE  2012.  •  W.  Chen,  X.  Han,  C.  Chang,  R.  Dömer,  “Elimina3ng  Race  Condi3ons  in  System-­‐Level  Models  by  using  Parallel  Simula3on  Infrastructure”,  HLDVT  2012.  •  W.  Chen,  X.  Han,  C.  Chang,  R.  Dömer,  “Advances  in  Parallel  Discrete  Event  Simula3on  for  Electronic  System-­‐Level  Design”,    

IEEE  Design  &  Test  of  Computers,  January-­‐February  2013.  •  W.  Chen,  R.  Dömer,    “Op3mized  Out-­‐of-­‐Order  Parallel  Discrete  Event  Simula3on  Using  Predic3ons”  DATE  2013.  

∀th∈READY,    if(  RUN.size  <  numCPUs  &&  NoConflict(th)),  Go(th);    

start

Simula5on  States  Update    

∀th∈WAIT,  if  th’s  event  is  noUfied  @  (te,  δe);  Move(th,  WAITtth,δth,  READYte,  δe+1)  

∀th∈WAITFOR,  Move(th,  WAITFORtth,δth,  READYtth,  0)  

RUN  ==  ∅  ?  

end

∀th∈READY,    if(  RUN.size  <  numCPUs  &&  NoConflict(th)),  Go(th);    

sleep;  No

Yes

 1:  #include  <stdio.h>    2:  int  array[]  =  {0,  1,  2,  3,  4,  5,  6,  7,  8,  9};    3:  behavior  B(in  int  begin,  in  int  end,  out  int  sum)    4:  {  int  i;                                  5:      void  main(){    6:        int  tmp;  tmp  =  0;  i  =  begin;      7:      waikor  1;      //  v3,  segment  3  starts        8:        while(i  <=  end){            9:            waikor  2;  //  v4,  segment  4  starts      10:            tmp  +=  array[i];  i  ++;      11:      }  12:        sum  =  tmp;  }  13:  };  14:  behavior  Main()  15:  {int  sum1,  sum2;    16:    B  b1(0,  4,  sum1);  B  b2(5,  9,  sum2);  17:    int  main(){  18:        par{  //  v1,  segment  1  starts  19:            b1.main();  20:            b2.main();    21:          }  //  v2,  segment  2  starts    22:        prink("summa7on  1  is  :%d  \n",  sum1);        23:        prink("summa7on  2  is  :%d  \n",  sum2);  }};  

v Out-­‐of-­‐order  Parallel  Discrete  Event  Simula7on  (OoO  PDES)  –  Breaks  simulaUon  cycle  barriers  by  localizing  the  Ume  for  each  thread  –  Aggressively  issues  threads  to  run  in  parallel  even  in  different  cycles  –  Preserves  the  simulaUon  correctness  and  Uming  accuracy  

waiHor  1    (Main.b1)  

seg3    

seg4  

thread_b1   par  

par  end  

seg0  thread_Main  

seg2  thread_Main  

seg1  

waiHor  1    (Main.b2)  

seg1  

seg5    

seg6  

thread_b2  

waiHor  2    (Main.b1)   waiHor  2    (Main.b2)  

seg   0   1   2   3   4   5   6  

0   F   F   F   F   F   F   F  

1   F   T   F   T   T   T   T  

2   F   F   F   T   T   T   T  

3   F   T   T   T   T   F   F  

4   F   T   T   T   T   F   F  

5   F   T   T   F   F   T   T  

6   F   T   T   F   F   T   T  

•  StaUc  Conflict  Analysis    –  Compiler  builds  Segment  Graph  (SG)  derived  from    Control  Flow  Graph  (CFG)  of  applicaUons  

–  Compiler  builds  Segment  Conflict  Tables  for  quick  look-­‐up  at  runUme  

Example   Segment  Graph  

•  OoO  PDES  OpUmizaUons    –  Instance  isola5on  reduces  false  conflicts  [ASPDAC’12]  –  Scheduling  predic5on  for  more  parallelism  [DATE’13]    

•  StaUc  code  analysis  applicaUons    –  Conflicts  detecUon  for  recoding    –  Debugging  when  parallelizing  applicaUons  [HLDVT’12]  

Segment  Data  Conflict  Table  

10:0  10:1  

10:2  

20:1  20:0  

20:2  

30:0  

0:0  T:Δ  th4  th2   th3  th1   #core  

4  

4  

4  

4  

4  3  

1  

th4  th2   th3  th1  

shortend  sim  run  7me  

Dynamic  States  in  OoO  PDES  

READYt,δ  

WAITt,δ  

RUNt,δ  

READYt,δ+1  

READYt+d1,0  

…   …   …  

WAITFORt+d2,0  

READYt+d1,1  

WAITt,δ+1  

WAITt+d1,0  

RUNt,δ+1  

RUNt+d1,0  

RUNt+d1,1  

WAITFORt+d1,0  

0  

10  

20  

30  

40  

50  

0  

200  

400  

600  

800  

1000  

1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  

Out-­‐of-­‐Order  Parallel  Simula7on  with  Predic7ons  (H.264  encoder)  

sim  run7me  (sec)   comp  7me(sec)  

v Experiments  and  Results  •  Parallel  Fibonacci  calculaUon  with  Uming  informaUon  •  Parallel  JPEG  image  encoder  with  3  color  components  encoded  in  parallel,  a  sequenUal  Huffman  encoding  •  Parallel  H.264  video  decoder  with  4  slice  decoders  and  a  sequenUal  slice  reader  and  synchronizer  Ø  Host:  64-­‐bit  Fedora  12  Linux  with  2  6-­‐core  CPUs  (Intel(R)  Xeon(R)  X5650)  at  2.67  GHz  with  2  hyper-­‐threads  per  

core  (supports  24  threads  in  parallel)  

0  

0.5  

1  

1.5  

2  

2.5  

3  

spec   arch   sched   net  

Simula7on  Run7me  for  the  JPEG  encoder  model  [sec]  

Tradi7onal  sequen7al     Synchronous  PDES   OoO  PDES  

0  

0.5  

1  

1.5  

2  

2.5  

3  

3.5  

spec   arch   sched   net  

Compila7on  7me  for  the  JPEG  encoder  model  [sec]  

Tradi7onal  sequen7al     Synchronous  PDES   OoO  PDES  

0  

50  

100  

150  

200  

spec   arch   sched   net  

Simula7on  run7me  for  the  H.264  decoder  model  [sec]  

Tradi7onal  sequen7al     Synchronous  PDES   OoO  PDES  

10  

10.5  

11  

11.5  

12  

12.5  

13  

13.5  

spec   arch   sched   net  

Compila7on  7me  for  the  H.264  decoder  model  [sec]  

Tradi7onal  sequen7al     Synchronous  PDES   OoO  PDES  

0

2

4

6

8

10

12

14

0

5

10

15

20

25

30

35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Parallel Fibonacci calculation with timing information fibo_timed elapsed time (SPDES) [sec] fibo_timed elapsed time (OoO PDES) [sec] fibo_timed rel. speedup (SPDES) fibo_timed rel. speedup (OoO PDES)