28
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND NO. 2016-0933PE Kokkos – Portability, Performance, Produc7vity Chris7an Tro<, Carter Edwards, Nathan Ellingwood, Si Hammond crtro<@sandia.gov Center for Compu7ng Research Sandia Na7onal Laboratories, NM

Kokkos%–Portability,%Performance,%Produc7vity% · APM XGene1 MPI 1 x 8 Threads (Problem 90) POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem 90/Rank) POWER8-XL Single NUMA

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Kokkos%–Portability,%Performance,%Produc7vity% · APM XGene1 MPI 1 x 8 Threads (Problem 90) POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem 90/Rank) POWER8-XL Single NUMA

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND NO. 2016-0933PE

Kokkos  –  Portability,  Performance,  Produc7vity  Chris7an  Tro<,  Carter  Edwards,  Nathan  Ellingwood,  Si  Hammond  

crtro<@sandia.gov  Center  for  Compu7ng  Research  

Sandia  Na7onal  Laboratories,  NM  

Page 2: Kokkos%–Portability,%Performance,%Produc7vity% · APM XGene1 MPI 1 x 8 Threads (Problem 90) POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem 90/Rank) POWER8-XL Single NUMA

Kokkos:  Performance,  Portability  and  Produc3vity  

2  

DDR#

HBM#

DDR#

HBM#

DDR#DDR#

DDR#

HBM#HBM#

Kokkos#

LAMMPS# Sierra# Albany#Trilinos#

Page 3: Kokkos%–Portability,%Performance,%Produc7vity% · APM XGene1 MPI 1 x 8 Threads (Problem 90) POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem 90/Rank) POWER8-XL Single NUMA

Kokkos:  Performance,  Portability  and  Produc3vity  

§  A  programming  model  implemented  as  a  C++  library  §  Abstrac7ons  for  Parallel  Execu7on  and  Data  Management  

§  Execu7on  Pa<ern:  What  kind  of  opera7on  (for-­‐each,  reduc7on,  scan,  task)  

§  Execu7on  Policy:  How  to  execute  (Range  Policy,  Team  Policy,  DAG)  §  Execu7on  Space:  Where  to  execute  (GPU,  Host  Threads,  PIM)  §  Memory  Layout:  How  to  map  indicies  to  storage  (Column/Row  Major)  §  Memory  Traits:  How  to  access  the  data  (Random,  Stream,  Atomic)  §  Memory  Space:  Where  does  the  data  live  (High  Bandwidth,  DDR,  NV)  

§  Supports  mul7ple  backends:  OpenMP,  Pthreads,  Cuda,  Qthreads,  Kalmar  (experimental)  

§  Sandia  applica7on  teams  commi<ed  to  Kokkos  as  its  path  for  transi7oning  legacy  codes,  and  as  part  of  its  new  codes  §  Trilinos,  LAMMPS,  Albany,  Sierra  Mechanics,  …    

3  

Page 4: Kokkos%–Portability,%Performance,%Produc7vity% · APM XGene1 MPI 1 x 8 Threads (Problem 90) POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem 90/Rank) POWER8-XL Single NUMA

ASC  L2  2015  Codesign  Milestone  §  Evaluate  the  performance  and  produc7vity  tradeoff  when  using  the  Kokkos  C++  

programming  model  §  Chose  LLNL  mini-­‐App  Lulesh  to  demonstrate  broad  applicability  §  Report:  h<p://prod.sandia.gov/techlib/access-­‐control.cgi/2015/157886.pdf  

0

500

1000

1500

Kokkos Minimal CPU

Kokkos Minimal GPU

Kokkos Optimized v1

Kokkos Optimized v2

Kokkos Optimized v3

OpenMP Optimized

OpenMP Original

Sour

ce C

ode

Line

s

Source Code Lines Added/Removed Compared to Serial Lines Added

Lines Removed

0

5000

10000

15000

20000

Haswell Single Socket MPI 1 x 16 Threads (Problem

90)

Haswell Single Socket MPI 1 x 32

(Problem 90)

Knights Corner MPI 1 x 224 (Problem 90)

Sandy Bridge Single Socket MPI 1 x 8

(Problem 90)

NVIDIA K40 (Problem 90)

APM XGene1 MPI 1 x 8 Threads (Problem

90)

POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem

90/Rank)

POWER8-XL Single NUMA Domain MPI 1

x 40 Threads (Problem 90)

Zone

s pe

r Sec

ond

per R

ank

LULESH Benchmark Figure of Merit on Multi-Core, Many-Core and GPU Systems (Problem Size 90)

Original OpenMP OpenMP Optimized

Kokkos Minimal Kokkos Optimized v1

Kokkos Optimized v2 Kokkos Optimized v3

Page 5: Kokkos%–Portability,%Performance,%Produc7vity% · APM XGene1 MPI 1 x 8 Threads (Problem 90) POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem 90/Rank) POWER8-XL Single NUMA

Going  Produc7on    §  Kokkos  released  on  github  in  March  2015  

§  Develop  /  Master  branch  system  =>  merge  requires  applica7on  passing  §  Tes7ng  Nightly:  11  Compilers,  total  of  90  backend  configura7ons,  warnings  as  errors    §  Extensive  Tutorials  and  Documenta7on  >  300  slides/pages  

§  Trilinos  NGP  stack  uses  Kokkos  as  only  backend  §  Tpetra,  Belos,  MueLu  etc.    §  Working  on  threading  all  kernels,  and  support  GPUs  

§  Sandia  Sierra  Mechanics  going  to  transi7on  to  Kokkos  §  Decided  to  go  with  Kokkos  instead  of  OpenMP  (only  other  realis7c  choice)  §  FY  2016:  prototyping  threaded  algorithms,  explore  code  pa<erns  §  Data  management  postponed  to  FY  2017  and  follow  on  

§  Sandia  ATDM  has  Kokkos  as  big  component  §  All  ATDM  Apps  are  using  Kokkos  §  Add  System  level  Tasking  with  Dharma  later  

5  

Page 6: Kokkos%–Portability,%Performance,%Produc7vity% · APM XGene1 MPI 1 x 8 Threads (Problem 90) POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem 90/Rank) POWER8-XL Single NUMA

Improved  View  Capabili7es  §  View  now  allow  allocatable  types  

§  Enables  View-­‐of-­‐View  concept  (View<View<double*>  >)  §  Simplifies  implementa7on  for  enabling  UQ  types  

§  U7lize  const  expression  C++11  func7onality  §  Be<er  op7miza7on;  open  smaller  binary  size  

§  Improvement  in  underlying  assembly/instruc7on  genera7on  §  Makes  debugging  and  profiling  mapping  much  cleaner  

 

6  

Page 7: Kokkos%–Portability,%Performance,%Produc7vity% · APM XGene1 MPI 1 x 8 Threads (Problem 90) POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem 90/Rank) POWER8-XL Single NUMA

KokkosP  Profiling  Interface  §  Dynamic  Run7me  Linkable  profiling  tools  

§  Not  LD_PRELOAD  based  (horray!)  §  Profiling  hooks  are  always  enabled  (i.e.  also  in  release  builds)  

§  Compile  once,  run  any7me,  profile  any7me,  no  confusion  or  recompile!  

§  Tool  Chaining  allowed  (many  results  from  one  run)  §  Very  low  overhead  if  not  enabled  

§  Simple  C  Interface  for  Tool  Connectors  §  Users/Vendors  can  write  their  own  profiling  tools  §  VTune,  NSight  and  LLNL-­‐Caliper  

§  Parallel  Dispatch  can  be  named  to  improve  context  mapping  

§  Ini7al  tools:  simple  kernel  7ming,  memory  profiling,  thread  affinity  checker,  vectoriza7on  connector  (APEX-­‐ECLDRD)  

7  

Page 8: Kokkos%–Portability,%Performance,%Produc7vity% · APM XGene1 MPI 1 x 8 Threads (Problem 90) POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem 90/Rank) POWER8-XL Single NUMA

Enhancing  Produc7vity:  Using  C++  Lambdas  §  C++11  Feature  which  simplify  using  abstrac7on  layers    

8  

Pragma Based OpenMP: !#pragma omp parallel for !for(int i=0; i<N; i++) { ! a[i] += b[i]; !} !!Functor Based Kokkos: !struct vector_add { ! View<double*> a; ! View<double*> b; ! vector_add(View<double*> a_, View<double*> b_): ! a(a_),b(b_){} ! KOKKOS_INLINE_FUNCTION ! void operator() (const int&i) const { ! a(i) += b(i); ! } !}; !!parallel_for( N, vector_add(a,b)); !!LAMBDA Based Kokkos: !parallel_for( N, KOKKOS_LAMBDA (const int& i) { ! a[i] += b[i]; !});

Page 9: Kokkos%–Portability,%Performance,%Produc7vity% · APM XGene1 MPI 1 x 8 Threads (Problem 90) POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem 90/Rank) POWER8-XL Single NUMA

Hierarchical  Parallelism  §  Expose  more  parallelism  §  Provides  API  for  efficient  scratch  space  usage  and  collabora7on  

9  

TeamPolicy<> policy = TeamPolicy<>(N, AUTO()).set_scratch_size(1,team_size,thread_size); !parallel_for( policy, KOKKOS_LAMBDA (TeamPolicy<>::member_type& team) { ! View<double*,ScratchSpace> team_shared(team.team_scratch(1),M); ! View<double*,ScratchSpace> thr_private(team.thread_scratch(1),P); ! ! // Fill shared and thread private data! team.team_barrier(); ! ! parallel_for(TeamThreadRange(team,0,M), [&] (int& j){ ! //Do Something! }); !! double sum; ! parallel_for(TeamThreadRange(team,0,M), [&] (int& j, double& lsum){ ! for(int k=0; k<P; k++) lsum+= thr_private(k); ! }); !! single(PerTeam(team), [&] () { global(team.league_rank()) = sum; } !});

Page 10: Kokkos%–Portability,%Performance,%Produc7vity% · APM XGene1 MPI 1 x 8 Threads (Problem 90) POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem 90/Rank) POWER8-XL Single NUMA

Hierarchical  Parallelism  §  Expose  more  parallelism  §  Provides  API  for  efficient  scratch  space  usage  and  collabora7on  

10  

TeamPolicy<> policy = TeamPolicy<>(N, AUTO()).set_scratch_size(1,team_size,thread_size); !parallel_for( policy, KOKKOS_LAMBDA (TeamPolicy<>::member_type& team) { ! View<double*,ScratchSpace> team_shared(team.team_scratch(1),M); ! View<double*,ScratchSpace> thr_private(team.thread_scratch(1),P); ! ! // Fill shared and thread private data! team.team_barrier(); ! ! parallel_for(TeamThreadRange(team,0,M), [&] (int& j){ ! //Do Something! }); !! double sum; ! parallel_for(TeamThreadRange(team,0,M), [&] (int& j, double& lsum){ ! for(int k=0; k<P; k++) lsum+= thr_private(k); ! }); !! single(PerTeam(team), [&] () { global(team.league_rank()) = sum; } !});

Page 11: Kokkos%–Portability,%Performance,%Produc7vity% · APM XGene1 MPI 1 x 8 Threads (Problem 90) POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem 90/Rank) POWER8-XL Single NUMA

Hierarchical  Parallelism  §  Expose  more  parallelism  §  Provides  API  for  efficient  scratch  space  usage  and  collabora7on  

11  

TeamPolicy<> policy = TeamPolicy<>(N, AUTO()).set_scratch_size(1,team_size,thread_size); !parallel_for( policy, KOKKOS_LAMBDA (TeamPolicy<>::member_type& team) { ! View<double*,ScratchSpace> team_shared(team.team_scratch(1),M); ! View<double*,ScratchSpace> thr_private(team.thread_scratch(1),P); ! ! // Fill shared and thread private data! team.team_barrier(); ! ! parallel_for(TeamThreadRange(team,0,M), [&] (int& j){ ! //Do Something! }); !! double sum; ! parallel_for(TeamThreadRange(team,0,M), [&] (int& j, double& lsum){ ! for(int k=0; k<P; k++) lsum+= thr_private(k); ! }); !! single(PerTeam(team), [&] () { global(team.league_rank()) = sum; } !});

Page 12: Kokkos%–Portability,%Performance,%Produc7vity% · APM XGene1 MPI 1 x 8 Threads (Problem 90) POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem 90/Rank) POWER8-XL Single NUMA

Hierarchical  Parallelism  §  Expose  more  parallelism  §  Provides  API  for  efficient  scratch  space  usage  and  collabora7on  

12  

TeamPolicy<> policy = TeamPolicy<>(N, AUTO()).set_scratch_size(1,team_size,thread_size); !parallel_for( policy, KOKKOS_LAMBDA (TeamPolicy<>::member_type& team) { ! View<double*,ScratchSpace> team_shared(team.team_scratch(1),M); ! View<double*,ScratchSpace> thr_private(team.thread_scratch(1),P); ! ! // Fill shared and thread private data! team.team_barrier(); ! ! parallel_for(TeamThreadRange(team,0,M), [&] (int& j){ ! //Do Something! }); !! double sum; ! parallel_for(TeamThreadRange(team,0,M), [&] (int& j, double& lsum){ ! for(int k=0; k<P; k++) lsum+= thr_private(k); ! }); !! single(PerTeam(team), [&] () { global(team.league_rank()) = sum; } !});

Page 13: Kokkos%–Portability,%Performance,%Produc7vity% · APM XGene1 MPI 1 x 8 Threads (Problem 90) POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem 90/Rank) POWER8-XL Single NUMA

Hierarchical  Parallelism  §  Expose  more  parallelism  §  Provides  API  for  efficient  scratch  space  usage  and  collabora7on  

13  

TeamPolicy<> policy = TeamPolicy<>(N, AUTO()).set_scratch_size(1,team_size,thread_size); !parallel_for( policy, KOKKOS_LAMBDA (TeamPolicy<>::member_type& team) { ! View<double*,ScratchSpace> team_shared(team.team_scratch(1),M); ! View<double*,ScratchSpace> thr_private(team.thread_scratch(1),P); ! ! // Fill shared and thread private data! team.team_barrier(); ! ! parallel_for(TeamThreadRange(team,0,M), [&] (int& j){ ! //Do Something! }); !! double sum; ! parallel_for(TeamThreadRange(team,0,M), [&] (int& j, double& lsum){ ! for(int k=0; k<P; k++) lsum+= thr_private(k); ! }); !! single(PerTeam(team), [&] () { global(team.league_rank()) = sum; } !});

Page 14: Kokkos%–Portability,%Performance,%Produc7vity% · APM XGene1 MPI 1 x 8 Threads (Problem 90) POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem 90/Rank) POWER8-XL Single NUMA

Hierarchical  Parallelism  §  Expose  more  parallelism  §  Provides  API  for  efficient  scratch  space  usage  and  collabora7on  

14  

TeamPolicy<> policy = TeamPolicy<>(N, AUTO()).set_scratch_size(1,team_size,thread_size); !parallel_for( policy, KOKKOS_LAMBDA (TeamPolicy<>::member_type& team) { ! View<double*,ScratchSpace> team_shared(team.team_scratch(1),M); ! View<double*,ScratchSpace> thr_private(team.thread_scratch(1),P); ! ! // Fill shared and thread private data! team.team_barrier(); ! ! parallel_for(TeamThreadRange(team,0,M), [&] (int& j){ ! //Do Something! }); !! double sum; ! parallel_for(TeamThreadRange(team,0,M), [&] (int& j, double& lsum){ ! for(int k=0; k<P; k++) lsum+= thr_private(k); ! }); !! single(PerTeam(team), [&] () { global(team.league_rank()) = sum; } !});

Page 15: Kokkos%–Portability,%Performance,%Produc7vity% · APM XGene1 MPI 1 x 8 Threads (Problem 90) POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem 90/Rank) POWER8-XL Single NUMA

Hierarchical  Parallelism  §  Expose  more  parallelism  §  Provides  API  for  efficient  scratch  space  usage  and  collabora7on  

15  

TeamPolicy<> policy = TeamPolicy<>(N, AUTO()).set_scratch_size(1,team_size,thread_size); !parallel_for( policy, KOKKOS_LAMBDA (TeamPolicy<>::member_type& team) { ! View<double*,ScratchSpace> team_shared(team.team_scratch(1),M); ! View<double*,ScratchSpace> thr_private(team.thread_scratch(1),P); ! ! // Fill shared and thread private data! team.team_barrier(); ! ! parallel_for(TeamThreadRange(team,0,M), [&] (int& j){ ! //Do Something! }); !! double sum; ! parallel_for(TeamThreadRange(team,0,M), [&] (int& j, double& lsum){ ! for(int k=0; k<P; k++) lsum+= thr_private(k); ! }); !! single(PerTeam(team), [&] () { global(team.league_rank()) = sum; } !});

Page 16: Kokkos%–Portability,%Performance,%Produc7vity% · APM XGene1 MPI 1 x 8 Threads (Problem 90) POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem 90/Rank) POWER8-XL Single NUMA

Hierarchical  Parallelism  §  Expose  more  parallelism  §  Provides  API  for  efficient  scratch  space  usage  and  collabora7on  

16  

TeamPolicy<> policy = TeamPolicy<>(N, AUTO()).set_scratch_size(1,team_size,thread_size); !parallel_for( policy, KOKKOS_LAMBDA (TeamPolicy<>::member_type& team) { ! View<double*,ScratchSpace> team_shared(team.team_scratch(1),M); ! View<double*,ScratchSpace> thr_private(team.thread_scratch(1),P); ! ! // Fill shared and thread private data! team.team_barrier(); ! ! parallel_for(TeamThreadRange(team,0,M), [&] (int& j){ ! //Do Something! }); !! double sum; ! parallel_for(TeamThreadRange(team,0,M), [&] (int& j, double& lsum){ ! for(int k=0; k<P; k++) lsum+= thr_private(k); ! }); !! single(PerTeam(team), [&] () { global(team.league_rank()) = sum; } !});

Page 17: Kokkos%–Portability,%Performance,%Produc7vity% · APM XGene1 MPI 1 x 8 Threads (Problem 90) POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem 90/Rank) POWER8-XL Single NUMA

Hierarchical  Parallelism  §  Expose  more  parallelism  §  Provides  API  for  efficient  scratch  space  usage  and  collabora7on  

17  

TeamPolicy<> policy = TeamPolicy<>(N, AUTO()).set_scratch_size(1,team_size,thread_size); !parallel_for( policy, KOKKOS_LAMBDA (TeamPolicy<>::member_type& team) { ! View<double*,ScratchSpace> team_shared(team.team_scratch(1),M); ! View<double*,ScratchSpace> thr_private(team.thread_scratch(1),P); ! ! // Fill shared and thread private data! team.team_barrier(); ! ! parallel_for(TeamThreadRange(team,0,M), [&] (int& j){ ! //Do Something! }); !! double sum; ! parallel_for(TeamThreadRange(team,0,M), [&] (int& j, double& lsum){ ! for(int k=0; k<P; k++) lsum+= thr_private(k); ! }); !! single(PerTeam(team), [&] () { global(team.league_rank()) = sum; } !});

Page 18: Kokkos%–Portability,%Performance,%Produc7vity% · APM XGene1 MPI 1 x 8 Threads (Problem 90) POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem 90/Rank) POWER8-XL Single NUMA

Hierarchical  Parallelism  §  Expose  more  parallelism  §  Provides  API  for  efficient  scratch  space  usage  and  collabora7on  

18  

TeamPolicy<> policy = TeamPolicy<>(N, AUTO()).set_scratch_size(1,team_size,thread_size); !parallel_for( policy, KOKKOS_LAMBDA (TeamPolicy<>::member_type& team) { ! View<double*,ScratchSpace> team_shared(team.team_scratch(1),M); ! View<double*,ScratchSpace> thr_private(team.thread_scratch(1),P); ! ! // Fill shared and thread private data! team.team_barrier(); ! ! parallel_for(TeamThreadRange(team,0,M), [&] (int& j){ ! //Do Something! }); !! double sum; ! parallel_for(TeamThreadRange(team,0,M), [&] (int& j, double& lsum){ ! for(int k=0; k<P; k++) lsum+= thr_private(k); ! }); !! single(PerTeam(team), [&] () { global(team.league_rank()) = sum; } !});

Page 19: Kokkos%–Portability,%Performance,%Produc7vity% · APM XGene1 MPI 1 x 8 Threads (Problem 90) POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem 90/Rank) POWER8-XL Single NUMA

Hierarchical  Parallelism  §  Expose  more  parallelism  §  Provides  API  for  efficient  scratch  space  usage  and  collabora7on  

19  

TeamPolicy<> policy = TeamPolicy<>(N, AUTO()).set_scratch_size(1,team_size,thread_size); !parallel_for( policy, KOKKOS_LAMBDA (TeamPolicy<>::member_type& team) { ! View<double*,ScratchSpace> team_shared(team.team_scratch(1),M); ! View<double*,ScratchSpace> thr_private(team.thread_scratch(1),P); ! ! // Fill shared and thread private data! team.team_barrier(); ! ! parallel_for(TeamThreadRange(team,0,M), [&] (int& j){ ! //Do Something! }); !! double sum; ! parallel_for(TeamThreadRange(team,0,M), [&] (int& j, double& lsum){ ! for(int k=0; k<P; k++) lsum+= thr_private(k); ! }); !! single(PerTeam(team), [&] () { global(team.league_rank()) = sum; } !});

Page 20: Kokkos%–Portability,%Performance,%Produc7vity% · APM XGene1 MPI 1 x 8 Threads (Problem 90) POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem 90/Rank) POWER8-XL Single NUMA

Hierarchical  Parallelism  §  Expose  more  parallelism  §  Provides  API  for  efficient  scratch  space  usage  and  collabora7on  

20  

TeamPolicy<> policy = TeamPolicy<>(N, AUTO()).set_scratch_size(1,team_size,thread_size); !parallel_for( policy, KOKKOS_LAMBDA (TeamPolicy<>::member_type& team) { ! View<double*,ScratchSpace> team_shared(team.team_scratch(1),M); ! View<double*,ScratchSpace> thr_private(team.thread_scratch(1),P); ! ! // Fill shared and thread private data! team.team_barrier(); ! ! parallel_for(TeamThreadRange(team,0,M), [&] (int& j){ ! //Do Something! }); !! double sum; ! parallel_for(TeamThreadRange(team,0,M), [&] (int& j, double& lsum){ ! for(int k=0; k<P; k++) lsum+= thr_private(k); ! }); !! single(PerTeam(team), [&] () { global(team.league_rank()) = sum; } !});

Page 21: Kokkos%–Portability,%Performance,%Produc7vity% · APM XGene1 MPI 1 x 8 Threads (Problem 90) POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem 90/Rank) POWER8-XL Single NUMA

Hierarchical  Parallelism  §  Expose  more  parallelism  §  Provides  API  for  efficient  scratch  space  usage  and  collabora7on  

21  

TeamPolicy<> policy = TeamPolicy<>(N, AUTO()).set_scratch_size(1,team_size,thread_size); !parallel_for( policy, KOKKOS_LAMBDA (TeamPolicy<>::member_type& team) { ! View<double*,ScratchSpace> team_shared(team.team_scratch(1),M); ! View<double*,ScratchSpace> thr_private(team.thread_scratch(1),P); ! ! // Fill shared and thread private data! team.team_barrier(); ! ! parallel_for(TeamThreadRange(team,0,M), [&] (int& j){ ! //Do Something! }); !! double sum; ! parallel_for(TeamThreadRange(team,0,M), [&] (int& j, double& lsum){ ! for(int k=0; k<P; k++) lsum+= thr_private(k); ! }); !! single(PerTeam(team), [&] () { global(team.league_rank()) = sum; } !});

Page 22: Kokkos%–Portability,%Performance,%Produc7vity% · APM XGene1 MPI 1 x 8 Threads (Problem 90) POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem 90/Rank) POWER8-XL Single NUMA

Dynamic  Scheduling    §  Addresses  simple  load  balancing  issues  §  Affinity  aware  Work  Stealing  Mechanism  (note:  OpenMP  is  work  sharing)  §  Up  to  100x  faster  scheduling  than  Intel  OpenMP  (based  on  scheduling  stress  test)  §  Modifier  on  execu7on  policy  e.g.:  RangePolicy<Schedule<Dynamic> >(0,N) !

22  

Page 23: Kokkos%–Portability,%Performance,%Produc7vity% · APM XGene1 MPI 1 x 8 Threads (Problem 90) POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem 90/Rank) POWER8-XL Single NUMA

Directed  Acyclic  Graph  (DAG)  of  Tasks  

23  

Page 24: Kokkos%–Portability,%Performance,%Produc7vity% · APM XGene1 MPI 1 x 8 Threads (Problem 90) POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem 90/Rank) POWER8-XL Single NUMA

§ A  Task  §  Is  a  C++  closure  (e.g.,  functor)  of  data  +  func7on  §  Executes  on  a  single  thread  or  thread  team  §  May  only  execute  when  its  dependences  are  complete  (DAG)  

§ A  Task’s  Life  Cycle:  

Directed  Acyclic  Graph  (DAG)  of  Tasks  

24  

construc6ng  

wai6ng  

execu6ng  

data  parallel  task  on  a  thread  team  

serial  task  on  a  single  thread  

complete  

Page 25: Kokkos%–Portability,%Performance,%Produc7vity% · APM XGene1 MPI 1 x 8 Threads (Problem 90) POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem 90/Rank) POWER8-XL Single NUMA

Task  Execu7on  Policy  § Manages  a  Heterogeneous  Collec6on  of  Tasks  §  Memory  alloca7on  and  dealloca7on  in  a  memory  space  §  Execu7on  on  a  thread  or  thread  team  in  an  execu7on  space  §  Scheduling  according  to  dependence  directed  acyclic  graph  (DAG)  

§ Challenges  §  Portability  across  mul7core/manycore  architectures:  CPU,  Xeon  Phi,  GPU,  …  §  Dynamic  –  crea7ng  tasks  within  execu7ng  task  §  Performance  –  thread  scalable  alloca7on/dealloca7on  within  finite  memory  §  Performance  –  execu7on  overhead  and  thread  scalable  scheduling  

§ Portability  and  Performance  Constraint:  Non-­‐blocking  Tasks  §  Eliminate  overhead  of  saving  execu7on  state:  registers,  stack,  …  §  Reduce  overhead  of  context  switching  

25  

Page 26: Kokkos%–Portability,%Performance,%Produc7vity% · APM XGene1 MPI 1 x 8 Threads (Problem 90) POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem 90/Rank) POWER8-XL Single NUMA

Managing  a  Non-­‐Blocking  Task’s  life-­‐cycle  §  Create:  allocate  and  construct    §  By  the  main  process  or  another  task    §  Allocate  from  task  policy  memory  pool  §  Construct  internal  data  §  Assign  DAG  dependences  

§  Spawn:  enqueue  to  scheduler  §  By  the  main  process  or  another  task    

§  Respawn:  re-­‐enqueue  to  scheduler  §  By  the  execu7ng  task  itself  §  Reassign  DAG  dependences  §  Replaces  task  spawning  and  wai7ng  upon  “child”  task(s)  

§  Create  and  spawn  child  task(s)  §  Assign  new  DAG  dependence(s)  to  new  child  task(s)  §  Re-­‐enqueue  to  be  executed  again  aper  child  task(s)  complete  

26  

construc6ng  

wai6ng  

execu6ng  

complete  

create  

spawn  

respawn  

Page 27: Kokkos%–Portability,%Performance,%Produc7vity% · APM XGene1 MPI 1 x 8 Threads (Problem 90) POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem 90/Rank) POWER8-XL Single NUMA

The  Way  Forward    

§  Stabilize  Capabili7es  §  Support  tasking  on  all  playorms  §  Make  sure  compilers  op7mize  through  layers  §  Harden  KNL  support  for  High  Bandwidth  Memory  

§  Support  Produc7on  Teams  in  Adop7on  §  Develop  more  Documenta7on    §  Extend  profiling  tools  to  help  with  transi7on    

27  

www.github.com/kokkos/kokkos: Kokkos Core Repository www.github.com/kokkos/kokkos-tutorials: Kokkos Tutorial Material www.github.com/kokkos/kokkos-profiling: Kokkos Profiling Tools (soon)

Page 28: Kokkos%–Portability,%Performance,%Produc7vity% · APM XGene1 MPI 1 x 8 Threads (Problem 90) POWER8-XL Dual Socket Node MPI 8 x 20 Threads (Problem 90/Rank) POWER8-XL Single NUMA