108
DM-HEOM: A Case Study in Performance Portability Matthias Noack ([email protected]) Zuse Institute Berlin Distributed Algorithms and Supercomputing 1 / 30 2018-09-26, IXPUG Annual Fall Conference 2018

DM-HEOM: ACaseStudyinPerformancePortability

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DM-HEOM: ACaseStudyinPerformancePortability

DM-HEOM:A Case Study in Performance Portability

Matthias Noack ([email protected])

Zuse Institute BerlinDistributed Algorithms and Supercomputing

1 / 302018-09-26, IXPUG Annual Fall Conference 2018

Page 2: DM-HEOM: ACaseStudyinPerformancePortability

Motivation

Observations:• many HPC applications are legacy code⇒ written and incrementally improved

by domain scientists⇒ often without (modern) software engineering

• hardware life time: 5 years . . . increasing variety• software life time: multiple tens of years ⇒ outlives systems by far⇒ need for code modernisation

What can we do better? – Our Suggestion:

• portability first, then portable performance• interdisciplinary collaborations where computer scientists design the software• modern methods and technologies with a community beyond HPC

2 / 30

Page 3: DM-HEOM: ACaseStudyinPerformancePortability

Motivation

Observations:• many HPC applications are legacy code⇒ written and incrementally improved

by domain scientists⇒ often without (modern) software engineering

• hardware life time: 5 years . . . increasing variety• software life time: multiple tens of years ⇒ outlives systems by far⇒ need for code modernisation

What can we do better? – Our Suggestion:

• portability first, then portable performance• interdisciplinary collaborations where computer scientists design the software• modern methods and technologies with a community beyond HPC

2 / 30

Page 4: DM-HEOM: ACaseStudyinPerformancePortability

Motivation

Observations:• many HPC applications are legacy code⇒ written and incrementally improved

by domain scientists⇒ often without (modern) software engineering

• hardware life time: 5 years . . . increasing variety• software life time: multiple tens of years ⇒ outlives systems by far⇒ need for code modernisation

What can we do better?

– Our Suggestion:

• portability first, then portable performance• interdisciplinary collaborations where computer scientists design the software• modern methods and technologies with a community beyond HPC

2 / 30

Page 5: DM-HEOM: ACaseStudyinPerformancePortability

Motivation

Observations:• many HPC applications are legacy code⇒ written and incrementally improved

by domain scientists⇒ often without (modern) software engineering

• hardware life time: 5 years . . . increasing variety• software life time: multiple tens of years ⇒ outlives systems by far⇒ need for code modernisation

What can we do better? – Our Suggestion:• portability first, then portable performance

• interdisciplinary collaborations where computer scientists design the software• modern methods and technologies with a community beyond HPC

2 / 30

Page 6: DM-HEOM: ACaseStudyinPerformancePortability

Motivation

Observations:• many HPC applications are legacy code⇒ written and incrementally improved

by domain scientists⇒ often without (modern) software engineering

• hardware life time: 5 years . . . increasing variety• software life time: multiple tens of years ⇒ outlives systems by far⇒ need for code modernisation

What can we do better? – Our Suggestion:• portability first, then portable performance• interdisciplinary collaborations where computer scientists design the software

• modern methods and technologies with a community beyond HPC

2 / 30

Page 7: DM-HEOM: ACaseStudyinPerformancePortability

Motivation

Observations:• many HPC applications are legacy code⇒ written and incrementally improved

by domain scientists⇒ often without (modern) software engineering

• hardware life time: 5 years . . . increasing variety• software life time: multiple tens of years ⇒ outlives systems by far⇒ need for code modernisation

What can we do better? – Our Suggestion:• portability first, then portable performance• interdisciplinary collaborations where computer scientists design the software• modern methods and technologies with a community beyond HPC

2 / 30

Page 8: DM-HEOM: ACaseStudyinPerformancePortability

Contributions

a) 1st Distributed Memory implementation of the HEOM method: DM-HEOM

b) Interdisciplinary development workflow

c) Design guidelines/experiences for performance portable HPC applications

3 / 30Paper: M. Noack, A. Reinefeld, T. Kramer, Th. Steinke: DM-HEOM: A Portable and Scalable Solver-Framework for the Hierarchical Equations of Motion

Page 9: DM-HEOM: ACaseStudyinPerformancePortability

Contributions

a) 1st Distributed Memory implementation of the HEOM method: DM-HEOM

b) Interdisciplinary development workflow

c) Design guidelines/experiences for performance portable HPC applications

3 / 30Paper: M. Noack, A. Reinefeld, T. Kramer, Th. Steinke: DM-HEOM: A Portable and Scalable Solver-Framework for the Hierarchical Equations of Motion

Page 10: DM-HEOM: ACaseStudyinPerformancePortability

Contributions

a) 1st Distributed Memory implementation of the HEOM method: DM-HEOM

b) Interdisciplinary development workflow

c) Design guidelines/experiences for performance portable HPC applications

3 / 30Paper: M. Noack, A. Reinefeld, T. Kramer, Th. Steinke: DM-HEOM: A Portable and Scalable Solver-Framework for the Hierarchical Equations of Motion

Page 11: DM-HEOM: ACaseStudyinPerformancePortability

HEOM - Hierarchical Equations of Motion

Model for Open Quantum Systems• understand the energy transfer inphoto-active molecular complexes⇒ e.g. photosynthesis

. . . but also quantum computing

• millions of coupled ODEs• hierarchical graph of complex matrices(auxiliary density operators, ADOs)⇒ dim: Nsites × Nsites⇒ count: exp. in hierarchy depth d

[Image by University of Copenhagen Biology Department]

4 / 30

Page 12: DM-HEOM: ACaseStudyinPerformancePortability

HEOM - Hierarchical Equations of Motion

Model for Open Quantum Systems• understand the energy transfer inphoto-active molecular complexes⇒ e.g. photosynthesis

. . . but also quantum computing

• millions of coupled ODEs

• hierarchical graph of complex matrices(auxiliary density operators, ADOs)⇒ dim: Nsites × Nsites⇒ count: exp. in hierarchy depth d

dσudt

= − i

h̄[H,σu]

− σu

B∑b=1

K−1∑k

nu,(b,k)γ(b, k)

−B∑

b=1

[ 2λb

βh̄2νb−

K−1∑k

c(b, k)

h̄γ(b, k)

]V ×s(b)V

×s(b)σu

+

B∑b=1

K−1∑k

iV ×s(b)σ

+u,b,k

+

B∑b=1

K−1∑k

nu,(b,k)θMA(b,k)σ−(u,b,k)

4 / 30

Page 13: DM-HEOM: ACaseStudyinPerformancePortability

HEOM - Hierarchical Equations of Motion

Model for Open Quantum Systems• understand the energy transfer inphoto-active molecular complexes⇒ e.g. photosynthesis

. . . but also quantum computing

• millions of coupled ODEs

• hierarchical graph of complex matrices(auxiliary density operators, ADOs)⇒ dim: Nsites × Nsites⇒ count: exp. in hierarchy depth d

dσudt

= − i

h̄[H,σu] (LvN commutator)

+∑baths

Aσu (same node)

+∑baths

Bσu+ (links to layer+1)

+∑baths

Cσu− (links to layer-1)

4 / 30

Page 14: DM-HEOM: ACaseStudyinPerformancePortability

HEOM - Hierarchical Equations of Motion

Model for Open Quantum Systems• understand the energy transfer inphoto-active molecular complexes⇒ e.g. photosynthesis

. . . but also quantum computing

• millions of coupled ODEs• hierarchical graph of complex matrices(auxiliary density operators, ADOs)⇒ dim: Nsites × Nsites⇒ count: exp. in hierarchy depth d

dσudt

= − i

h̄[H,σu] (LvN commutator)

+∑baths

Aσu (same node)

+∑baths

Bσu+ (links to layer+1)

+∑baths

Cσu− (links to layer-1)

4 / 30

Page 15: DM-HEOM: ACaseStudyinPerformancePortability

HEOM - Computational Properties• ODE: dominated by commutator term: [H, σu] = Hσu − σuH

• for each σu: 16N3sites FLOP per 2 · 2 · 8 · N2

sites Byte

⇒ arithmetic intensity: Nsites2

FLOPByte ⇒ compute bound for larger Nsites

• numerical integration (weighted add):

⇒ arithmetic intensity: 18

FLOPByte ⇒ memory bound on relevant hardware

compute memory bw.Device Name (architecture) [TFLOPS] [GiB/s] [FLOP/Byte]2× Intel Xeon Gold 6138 (SKL) 2.56 238 10.82× Intel Xeon E5-2680v3 (HSW) 0.96 136 7.1Intel Xeon Phi 7250 (KNL) 2.61a 490/115b 5.3/22.7b

Nvidia Tesla K40 (Kepler) 1.31 480 2.7AMD Firepro W8100 (Hawaii) 2.1 320 6.6aAssuming 1.2 GHz AVX frequency. bOn-chip MCDRAM / DRAM

5 / 30

Page 16: DM-HEOM: ACaseStudyinPerformancePortability

HEOM - Computational Properties• ODE: dominated by commutator term: [H, σu] = Hσu − σuH

• for each σu: 16N3sites FLOP per 2 · 2 · 8 · N2

sites Byte

⇒ arithmetic intensity: Nsites2

FLOPByte ⇒ compute bound for larger Nsites

• numerical integration (weighted add):

⇒ arithmetic intensity: 18

FLOPByte ⇒ memory bound on relevant hardware

compute memory bw.Device Name (architecture) [TFLOPS] [GiB/s] [FLOP/Byte]2× Intel Xeon Gold 6138 (SKL) 2.56 238 10.82× Intel Xeon E5-2680v3 (HSW) 0.96 136 7.1Intel Xeon Phi 7250 (KNL) 2.61a 490/115b 5.3/22.7b

Nvidia Tesla K40 (Kepler) 1.31 480 2.7AMD Firepro W8100 (Hawaii) 2.1 320 6.6aAssuming 1.2 GHz AVX frequency. bOn-chip MCDRAM / DRAM

5 / 30

Page 17: DM-HEOM: ACaseStudyinPerformancePortability

HEOM - Computational Properties• ODE: dominated by commutator term: [H, σu] = Hσu − σuH

• for each σu: 16N3sites FLOP per 2 · 2 · 8 · N2

sites Byte

⇒ arithmetic intensity: Nsites2

FLOPByte ⇒ compute bound for larger Nsites

• numerical integration (weighted add):

⇒ arithmetic intensity: 18

FLOPByte ⇒ memory bound on relevant hardware

compute memory bw.Device Name (architecture) [TFLOPS] [GiB/s] [FLOP/Byte]2× Intel Xeon Gold 6138 (SKL) 2.56 238 10.82× Intel Xeon E5-2680v3 (HSW) 0.96 136 7.1Intel Xeon Phi 7250 (KNL) 2.61a 490/115b 5.3/22.7b

Nvidia Tesla K40 (Kepler) 1.31 480 2.7AMD Firepro W8100 (Hawaii) 2.1 320 6.6aAssuming 1.2 GHz AVX frequency. bOn-chip MCDRAM / DRAM

5 / 30

Page 18: DM-HEOM: ACaseStudyinPerformancePortability

HEOM - Computational Properties• ODE: dominated by commutator term: [H, σu] = Hσu − σuH

• for each σu: 16N3sites FLOP per 2 · 2 · 8 · N2

sites Byte

⇒ arithmetic intensity: Nsites2

FLOPByte ⇒ compute bound for larger Nsites

• numerical integration (weighted add):

⇒ arithmetic intensity: 18

FLOPByte ⇒ memory bound on relevant hardware

compute memory bw.Device Name (architecture) [TFLOPS] [GiB/s] [FLOP/Byte]2× Intel Xeon Gold 6138 (SKL) 2.56 238 10.82× Intel Xeon E5-2680v3 (HSW) 0.96 136 7.1Intel Xeon Phi 7250 (KNL) 2.61a 490/115b 5.3/22.7b

Nvidia Tesla K40 (Kepler) 1.31 480 2.7AMD Firepro W8100 (Hawaii) 2.1 320 6.6aAssuming 1.2 GHz AVX frequency. bOn-chip MCDRAM / DRAM

5 / 30

Page 19: DM-HEOM: ACaseStudyinPerformancePortability

HEOM - Computational Properties• ODE: dominated by commutator term: [H, σu] = Hσu − σuH

• for each σu: 16N3sites FLOP per 2 · 2 · 8 · N2

sites Byte

⇒ arithmetic intensity: Nsites2

FLOPByte ⇒ compute bound for larger Nsites

• numerical integration (weighted add):

⇒ arithmetic intensity: 18

FLOPByte ⇒ memory bound on relevant hardware

compute memory bw.Device Name (architecture) [TFLOPS] [GiB/s] [FLOP/Byte]2× Intel Xeon Gold 6138 (SKL) 2.56 238 10.82× Intel Xeon E5-2680v3 (HSW) 0.96 136 7.1Intel Xeon Phi 7250 (KNL) 2.61a 490/115b 5.3/22.7b

Nvidia Tesla K40 (Kepler) 1.31 480 2.7AMD Firepro W8100 (Hawaii) 2.1 320 6.6aAssuming 1.2 GHz AVX frequency. bOn-chip MCDRAM / DRAM

5 / 30

Page 20: DM-HEOM: ACaseStudyinPerformancePortability

HEOM - Example Systems

system baths depth totalname Nsites per site d ADO memoryFMO 7 22 8 2.8 GiBLHC II monomer 14 1 3 5.6 GiBPS I 96 1 3 129.2 GiBPS I 96 1 4 3231.0 GiBmemory consumption assuming an RK4 solver

⇒ larger systems cannot be solved on a single node:• memory footprint• time to solution

⇒ distributed memory implementation required

6 / 30

Page 21: DM-HEOM: ACaseStudyinPerformancePortability

HEOM - Example Systems

system baths depth totalname Nsites per site d ADO memoryFMO 7 22 8 2.8 GiBLHC II monomer 14 1 3 5.6 GiBPS I 96 1 3 129.2 GiBPS I 96 1 4 3231.0 GiBmemory consumption assuming an RK4 solver

⇒ larger systems cannot be solved on a single node:• memory footprint• time to solution

⇒ distributed memory implementation required

6 / 30

Page 22: DM-HEOM: ACaseStudyinPerformancePortability

HEOM - Example Systems

system baths depth totalname Nsites per site d ADO memoryFMO 7 22 8 2.8 GiBLHC II monomer 14 1 3 5.6 GiBPS I 96 1 3 129.2 GiBPS I 96 1 4 3231.0 GiBmemory consumption assuming an RK4 solver

⇒ larger systems cannot be solved on a single node:• memory footprint• time to solution

⇒ distributed memory implementation required

6 / 30

Page 23: DM-HEOM: ACaseStudyinPerformancePortability

Interdisciplinary Workflow domain experts computer scientists

7 / 30

MathematicalModel

Page 24: DM-HEOM: ACaseStudyinPerformancePortability

Interdisciplinary Workflow domain experts computer scientists

7 / 30

MathematicalModel

- ODEs- PDEs- Graphs- . . .

Page 25: DM-HEOM: ACaseStudyinPerformancePortability

Interdisciplinary Workflow domain experts computer scientists

7 / 30

MathematicalModel

- ODEs- PDEs- Graphs- . . .

High-Level Prototype(Mathematica)

Page 26: DM-HEOM: ACaseStudyinPerformancePortability

Interdisciplinary Workflow domain experts computer scientists

7 / 30

MathematicalModel

- ODEs- PDEs- Graphs- . . .

High-Level Prototype(Mathematica)

- domain scientist’s tool- high level- symbolic solvers- arbitrary precision- very limited performance

Page 27: DM-HEOM: ACaseStudyinPerformancePortability

Interdisciplinary Workflow domain experts computer scientists

7 / 30

MathematicalModel

- ODEs- PDEs- Graphs- . . .

High-Level Prototype(Mathematica)

- domain scientist’s tool- high level- symbolic solvers- arbitrary precision- very limited performance

OpenCL kernelwithin Mathematica

Page 28: DM-HEOM: ACaseStudyinPerformancePortability

Interdisciplinary Workflow domain experts computer scientists

7 / 30

MathematicalModel

- ODEs- PDEs- Graphs- . . .

High-Level Prototype(Mathematica)

- domain scientist’s tool- high level- symbolic solvers- arbitrary precision- very limited performance

OpenCL kernelwithin Mathematica

- replace some codewith OpenCL

Page 29: DM-HEOM: ACaseStudyinPerformancePortability

OpenCL (Open Computing Language) in a Nutshell• open, royalty-free standard for

cross-platform, parallel programming• maintained by Khronos Group

• personal computers, servers, mobiledevices and embedded platforms• first released: 2009-08-28

8 / 30https://www.khronos.org/opencl/

Page 30: DM-HEOM: ACaseStudyinPerformancePortability

OpenCL (Open Computing Language) in a Nutshell• open, royalty-free standard for

cross-platform, parallel programming• maintained by Khronos Group

• personal computers, servers, mobiledevices and embedded platforms• first released: 2009-08-28

8 / 30https://www.khronos.org/opencl/

Page 31: DM-HEOM: ACaseStudyinPerformancePortability

OpenCL Platform and Memory Model

Host

. . .

. . .

. . .

...

. . .

. . .

. . .

...

. . .

. . .

. . .

.... . .

Compute Device (CD) Compute Unit (CU)

ProcessingElement(PE)

Memory Model:• CD has devicememory withglottal/constantaddr. space• CU has localmemory addr. space• PE has privatememory addr. space

⇒ relaxed consistency

9 / 30

Page 32: DM-HEOM: ACaseStudyinPerformancePortability

OpenCL Platform and Memory Model

Host

. . .

. . .

. . .

...

. . .

. . .

. . .

...

. . .

. . .

. . .

.... . .

Compute Device (CD) Compute Unit (CU)

ProcessingElement(PE)

Memory Model:• CD has devicememory withglottal/constantaddr. space• CU has localmemory addr. space• PE has privatememory addr. space

⇒ relaxed consistency

9 / 30

Page 33: DM-HEOM: ACaseStudyinPerformancePortability

OpenCL Machine Model MappingOpenCL Platform CPU Hardware GPU Hardware

Compute Device Processor/Board GPU device

Compute Unit Core (thread) Streaming MP

Processing Element SIMD Lane CUDA Core

global/const. memory DRAM DRAM

local memory DRAM Shared Memory

private memory Register/DRAM Priv. Mem./Register

⇒ write code for this abstract machine model⇒ device-specific OpenCL compiler and runtime maps it to actual hardware⇒ library-only implementation: no toolchain, many language bindings

⇒ currently: widest practical portability of parallel programming models

10 / 30

Page 34: DM-HEOM: ACaseStudyinPerformancePortability

OpenCL Machine Model MappingOpenCL Platform CPU Hardware GPU Hardware

Compute Device Processor/Board GPU device

Compute Unit Core (thread) Streaming MP

Processing Element SIMD Lane CUDA Core

global/const. memory DRAM DRAM

local memory DRAM Shared Memory

private memory Register/DRAM Priv. Mem./Register

⇒ write code for this abstract machine model

⇒ device-specific OpenCL compiler and runtime maps it to actual hardware⇒ library-only implementation: no toolchain, many language bindings

⇒ currently: widest practical portability of parallel programming models

10 / 30

Page 35: DM-HEOM: ACaseStudyinPerformancePortability

OpenCL Machine Model MappingOpenCL Platform CPU Hardware GPU Hardware

Compute Device Processor/Board GPU device

Compute Unit Core (thread) Streaming MP

Processing Element SIMD Lane CUDA Core

global/const. memory DRAM DRAM

local memory DRAM Shared Memory

private memory Register/DRAM Priv. Mem./Register

⇒ write code for this abstract machine model⇒ device-specific OpenCL compiler and runtime maps it to actual hardware

⇒ library-only implementation: no toolchain, many language bindings

⇒ currently: widest practical portability of parallel programming models

10 / 30

Page 36: DM-HEOM: ACaseStudyinPerformancePortability

OpenCL Machine Model MappingOpenCL Platform CPU Hardware GPU Hardware

Compute Device Processor/Board GPU device

Compute Unit Core (thread) Streaming MP

Processing Element SIMD Lane CUDA Core

global/const. memory DRAM DRAM

local memory DRAM Shared Memory

private memory Register/DRAM Priv. Mem./Register

⇒ write code for this abstract machine model⇒ device-specific OpenCL compiler and runtime maps it to actual hardware⇒ library-only implementation: no toolchain, many language bindings

⇒ currently: widest practical portability of parallel programming models

10 / 30

Page 37: DM-HEOM: ACaseStudyinPerformancePortability

OpenCL Machine Model MappingOpenCL Platform CPU Hardware GPU Hardware

Compute Device Processor/Board GPU device

Compute Unit Core (thread) Streaming MP

Processing Element SIMD Lane CUDA Core

global/const. memory DRAM DRAM

local memory DRAM Shared Memory

private memory Register/DRAM Priv. Mem./Register

⇒ write code for this abstract machine model⇒ device-specific OpenCL compiler and runtime maps it to actual hardware⇒ library-only implementation: no toolchain, many language bindings⇒ currently: widest practical portability of parallel programming models

10 / 30

Page 38: DM-HEOM: ACaseStudyinPerformancePortability

Interdisciplinary Workflow domain experts computer scientists

11 / 30

MathematicalModel

- ODEs- PDEs- Graphs- . . .

High-Level Prototype(Mathematica)

- domain scientist’s tool- high level- symbolic solvers- arbitrary precision- very limited performance

OpenCL kernelwithin Mathematica

- replace some codewith OpenCL

- compare results- figure out numerics- use accelerators in MM

Page 39: DM-HEOM: ACaseStudyinPerformancePortability

Interdisciplinary Workflow domain experts computer scientists

11 / 30

MathematicalModel

- ODEs- PDEs- Graphs- . . .

High-Level Prototype(Mathematica)

- domain scientist’s tool- high level- symbolic solvers- arbitrary precision- very limited performance

OpenCL kernelwithin Mathematica

- replace some codewith OpenCL

- compare results- figure out numerics- use accelerators in MM

. . .

. . . C++ App. withOpenCL kernel

Page 40: DM-HEOM: ACaseStudyinPerformancePortability

DM-HEOM: Design Goals and Principles• portability

• modern, vendor-independent standards: C++11/14/17, OpenCL, MPI-3, CMake, . . .

• performance portability• flexibility: make performance-critical aspects configurable• avoid device-specific code branches (i.e. redundant code)

• scalability• from single nodes to supercomputers• a single code base for small and large problems

• powerful high-level abstractions• small applications on top• easily understood by domain scientists

• separation and exchangeability of different aspects and strategies• partitioning, numerical methods, memory layout, parallelisation, communication, etc.

12 / 30

Page 41: DM-HEOM: ACaseStudyinPerformancePortability

DM-HEOM: Design Goals and Principles• portability

• modern, vendor-independent standards: C++11/14/17, OpenCL, MPI-3, CMake, . . .• performance portability

• flexibility: make performance-critical aspects configurable• avoid device-specific code branches (i.e. redundant code)

• scalability• from single nodes to supercomputers• a single code base for small and large problems

• powerful high-level abstractions• small applications on top• easily understood by domain scientists

• separation and exchangeability of different aspects and strategies• partitioning, numerical methods, memory layout, parallelisation, communication, etc.

12 / 30

Page 42: DM-HEOM: ACaseStudyinPerformancePortability

DM-HEOM: Design Goals and Principles• portability

• modern, vendor-independent standards: C++11/14/17, OpenCL, MPI-3, CMake, . . .• performance portability

• flexibility: make performance-critical aspects configurable• avoid device-specific code branches (i.e. redundant code)

• scalability• from single nodes to supercomputers• a single code base for small and large problems

• powerful high-level abstractions• small applications on top• easily understood by domain scientists

• separation and exchangeability of different aspects and strategies• partitioning, numerical methods, memory layout, parallelisation, communication, etc.

12 / 30

Page 43: DM-HEOM: ACaseStudyinPerformancePortability

DM-HEOM: Design Goals and Principles• portability

• modern, vendor-independent standards: C++11/14/17, OpenCL, MPI-3, CMake, . . .• performance portability

• flexibility: make performance-critical aspects configurable• avoid device-specific code branches (i.e. redundant code)

• scalability• from single nodes to supercomputers• a single code base for small and large problems

• powerful high-level abstractions• small applications on top• easily understood by domain scientists

• separation and exchangeability of different aspects and strategies• partitioning, numerical methods, memory layout, parallelisation, communication, etc.

12 / 30

Page 44: DM-HEOM: ACaseStudyinPerformancePortability

DM-HEOM: Design Goals and Principles• portability

• modern, vendor-independent standards: C++11/14/17, OpenCL, MPI-3, CMake, . . .• performance portability

• flexibility: make performance-critical aspects configurable• avoid device-specific code branches (i.e. redundant code)

• scalability• from single nodes to supercomputers• a single code base for small and large problems

• powerful high-level abstractions• small applications on top• easily understood by domain scientists

• separation and exchangeability of different aspects and strategies• partitioning, numerical methods, memory layout, parallelisation, communication, etc.

12 / 30

Page 45: DM-HEOM: ACaseStudyinPerformancePortability

Interdisciplinary Workflow domain experts computer scientists

13 / 30

MathematicalModel

- ODEs- PDEs- Graphs- . . .

High-Level Prototype(Mathematica)

- domain scientist’s tool- high level- symbolic solvers- arbitrary precision- very limited performance

OpenCL kernelwithin Mathematica

- replace some codewith OpenCL

- compare results- figure out numerics- use accelerators in MM

. . .

. . . C++ App. withOpenCL kernel

Page 46: DM-HEOM: ACaseStudyinPerformancePortability

Interdisciplinary Workflow domain experts computer scientists

13 / 30

MathematicalModel

- ODEs- PDEs- Graphs- . . .

High-Level Prototype(Mathematica)

- domain scientist’s tool- high level- symbolic solvers- arbitrary precision- very limited performance

OpenCL kernelwithin Mathematica

- replace some codewith OpenCL

- compare results- figure out numerics- use accelerators in MM

. . .

. . . C++ App. withOpenCL kernel

- start single node- OpenCL 1.2 for hotspots- modern C++ 11/14/17- CMake for building

Page 47: DM-HEOM: ACaseStudyinPerformancePortability

Distributed Memory HEOM: Single Node C++ Application

PartitionMapping

HEOM Configuration

Hierarchy Graph

Hierarchy Partition

Instance

Solver ODE register neighborexchange action

Communicator

OpenCLConfig

HEOMConfig

Results

• HEOM Config describes physics

• Hierarchy Graph is used to initialisea problem Instance• ODE (HEOM formula)⇒ encapsulates OpenCL

kernel code

• OpenCL Config specifiesruntime configuration• Solver works on ODE⇒ encapsulates OpenCL

runtime⇒ encapsulates numerics⇒ produces Results

14 / 30

Page 48: DM-HEOM: ACaseStudyinPerformancePortability

Distributed Memory HEOM: Single Node C++ Application

PartitionMapping

HEOM Configuration

Hierarchy Graph

Hierarchy Partition

Instance

Solver ODE register neighborexchange action

Communicator

OpenCLConfig

HEOMConfig

Results

• HEOM Config describes physics

• Hierarchy Graph is used to initialisea problem Instance• ODE (HEOM formula)⇒ encapsulates OpenCL

kernel code

• OpenCL Config specifiesruntime configuration• Solver works on ODE⇒ encapsulates OpenCL

runtime⇒ encapsulates numerics⇒ produces Results

14 / 30

Page 49: DM-HEOM: ACaseStudyinPerformancePortability

Distributed Memory HEOM: Single Node C++ Application

PartitionMapping

HEOM Configuration

Hierarchy Graph

Hierarchy Partition

Instance

Solver ODE register neighborexchange action

Communicator

OpenCLConfig

HEOMConfig

Results

• HEOM Config describes physics

• Hierarchy Graph is used to initialisea problem Instance

• ODE (HEOM formula)⇒ encapsulates OpenCL

kernel code

• OpenCL Config specifiesruntime configuration• Solver works on ODE⇒ encapsulates OpenCL

runtime⇒ encapsulates numerics⇒ produces Results

14 / 30

Page 50: DM-HEOM: ACaseStudyinPerformancePortability

Distributed Memory HEOM: Single Node C++ Application

PartitionMapping

HEOM Configuration

Hierarchy Graph

Hierarchy Partition

Instance

Solver ODE register neighborexchange action

Communicator

OpenCLConfig

HEOMConfig

Results

• HEOM Config describes physics

• Hierarchy Graph is used to initialisea problem Instance• ODE (HEOM formula)⇒ encapsulates OpenCL

kernel code

• OpenCL Config specifiesruntime configuration• Solver works on ODE⇒ encapsulates OpenCL

runtime⇒ encapsulates numerics⇒ produces Results

14 / 30

Page 51: DM-HEOM: ACaseStudyinPerformancePortability

Distributed Memory HEOM: Single Node C++ Application

PartitionMapping

HEOM Configuration

Hierarchy Graph

Hierarchy Partition

Instance

Solver ODE register neighborexchange action

Communicator

OpenCLConfig

HEOMConfig

Results

• HEOM Config describes physics

• Hierarchy Graph is used to initialisea problem Instance• ODE (HEOM formula)⇒ encapsulates OpenCL

kernel code

• OpenCL Config specifiesruntime configuration

• Solver works on ODE⇒ encapsulates OpenCL

runtime⇒ encapsulates numerics⇒ produces Results

14 / 30

Page 52: DM-HEOM: ACaseStudyinPerformancePortability

Distributed Memory HEOM: Single Node C++ Application

PartitionMapping

HEOM Configuration

Hierarchy Graph

Hierarchy Partition

Instance

Solver ODE register neighborexchange action

Communicator

OpenCLConfig

HEOMConfig

Results

• HEOM Config describes physics

• Hierarchy Graph is used to initialisea problem Instance• ODE (HEOM formula)⇒ encapsulates OpenCL

kernel code

• OpenCL Config specifiesruntime configuration• Solver works on ODE⇒ encapsulates OpenCL

runtime⇒ encapsulates numerics⇒ produces Results

14 / 30

Page 53: DM-HEOM: ACaseStudyinPerformancePortability

From Portability to Performance

OpenCL:• guarantees portability . . .• . . . but no portable performance

Strategy:

1. identify key optimisations each device requires2. make the code configurable to the device’s needs

. . . without writing a version for each device

15 / 30

Page 54: DM-HEOM: ACaseStudyinPerformancePortability

From Portability to Performance

OpenCL:• guarantees portability . . .• . . . but no portable performance

Strategy:

1. identify key optimisations each device requires2. make the code configurable to the device’s needs

. . . without writing a version for each device

15 / 30

Page 55: DM-HEOM: ACaseStudyinPerformancePortability

Performance Portability: Node-LevelOpenCL Runtime Kernel Compilation• necessary for portability

• exploitable for performancea) facilitate compiler optimisationb) configure code before compilation

a) compiler optimisation:• use host-code runtime-constants as kernel-code compile-time constants

• e.g. sizes, loop-counts, . . .⇒ resolve index computations, eliminate branches, unroll loops, . . .

b) configurable kernel code:• work-item granularity• memory-layout

16 / 30Runtime compilation for non-OpenCL codes: https://github.com/noma/kart

Page 56: DM-HEOM: ACaseStudyinPerformancePortability

Performance Portability: Node-LevelOpenCL Runtime Kernel Compilation• necessary for portability• exploitable for performance

a) facilitate compiler optimisationb) configure code before compilation

a) compiler optimisation:• use host-code runtime-constants as kernel-code compile-time constants

• e.g. sizes, loop-counts, . . .⇒ resolve index computations, eliminate branches, unroll loops, . . .

b) configurable kernel code:• work-item granularity• memory-layout

16 / 30Runtime compilation for non-OpenCL codes: https://github.com/noma/kart

Page 57: DM-HEOM: ACaseStudyinPerformancePortability

Performance Portability: Node-LevelOpenCL Runtime Kernel Compilation• necessary for portability• exploitable for performance

a) facilitate compiler optimisationb) configure code before compilation

a) compiler optimisation:• use host-code runtime-constants as kernel-code compile-time constants

• e.g. sizes, loop-counts, . . .⇒ resolve index computations, eliminate branches, unroll loops, . . .

b) configurable kernel code:• work-item granularity• memory-layout

16 / 30Runtime compilation for non-OpenCL codes: https://github.com/noma/kart

Page 58: DM-HEOM: ACaseStudyinPerformancePortability

Performance Portability: Node-LevelOpenCL Runtime Kernel Compilation• necessary for portability• exploitable for performance

a) facilitate compiler optimisationb) configure code before compilation

a) compiler optimisation:• use host-code runtime-constants as kernel-code compile-time constants

• e.g. sizes, loop-counts, . . .⇒ resolve index computations, eliminate branches, unroll loops, . . .

b) configurable kernel code:• work-item granularity• memory-layout

16 / 30Runtime compilation for non-OpenCL codes: https://github.com/noma/kart

Page 59: DM-HEOM: ACaseStudyinPerformancePortability

Performance Portability: Work-item Granularity• amount of work per OpenCL work-item• processed on the smallest parallel hardware execution unit (PE)⇒ most important for efficient device utilisation

GPU Devices• many SIMT cores• thousands of light-weight hardwarethreads• executed in groups of 32 or 64 thread⇒ one ADO matrix element per thread

CPU Devices• fewer, more complex general-purposecores• SIMD vector units with 4 to 8 lanes⇒ one ADO matrix per processing

element

⇒ requires a compile-time configurable outer loop-nest inside the kernel

17 / 30

Page 60: DM-HEOM: ACaseStudyinPerformancePortability

Performance Portability: Work-item Granularity• amount of work per OpenCL work-item• processed on the smallest parallel hardware execution unit (PE)⇒ most important for efficient device utilisation

GPU Devices• many SIMT cores• thousands of light-weight hardwarethreads• executed in groups of 32 or 64 thread⇒ one ADO matrix element per thread

CPU Devices• fewer, more complex general-purposecores• SIMD vector units with 4 to 8 lanes⇒ one ADO matrix per processing

element

⇒ requires a compile-time configurable outer loop-nest inside the kernel

17 / 30

Page 61: DM-HEOM: ACaseStudyinPerformancePortability

Performance Portability: Work-item Granularity• amount of work per OpenCL work-item• processed on the smallest parallel hardware execution unit (PE)⇒ most important for efficient device utilisation

GPU Devices• many SIMT cores• thousands of light-weight hardwarethreads• executed in groups of 32 or 64 thread⇒ one ADO matrix element per thread

CPU Devices• fewer, more complex general-purposecores• SIMD vector units with 4 to 8 lanes⇒ one ADO matrix per processing

element

⇒ requires a compile-time configurable outer loop-nest inside the kernel

17 / 30

Page 62: DM-HEOM: ACaseStudyinPerformancePortability

Performance Portability: Work-item Granularity• amount of work per OpenCL work-item• processed on the smallest parallel hardware execution unit (PE)⇒ most important for efficient device utilisation

GPU Devices• many SIMT cores• thousands of light-weight hardwarethreads• executed in groups of 32 or 64 thread⇒ one ADO matrix element per thread

CPU Devices• fewer, more complex general-purposecores• SIMD vector units with 4 to 8 lanes⇒ one ADO matrix per processing

element

⇒ requires a compile-time configurable outer loop-nest inside the kernel17 / 30

Page 63: DM-HEOM: ACaseStudyinPerformancePortability

Performance Portability: Memory Layout

⇒ match requirement of device-specific memory architecture• GPU: coalesced access from working-groups without bank-conflicts• CPU: contiguous SIMD vector load/store instructions (avoid gather/scatter ops)

⇒ parallelisation strategy, granularity and memory layout must match• e.g. outer loop vectorisation for SIMD architectures

18 / 30

Page 64: DM-HEOM: ACaseStudyinPerformancePortability

Performance Portability: Memory Layout

⇒ match requirement of device-specific memory architecture• GPU: coalesced access from working-groups without bank-conflicts• CPU: contiguous SIMD vector load/store instructions (avoid gather/scatter ops)

⇒ parallelisation strategy, granularity and memory layout must match• e.g. outer loop vectorisation for SIMD architectures

18 / 30

Page 65: DM-HEOM: ACaseStudyinPerformancePortability

AoS vs. SIMD-friendly AoSoA Memory Layout

a11 a12 a13 . . . a1n a21 . . . a2n . . . ann b11 b12 . . . b1n b21 . . . bnn c11 . . . c1n c21 . . . cnn . . . h11 . . . hnn . . .

work-item1 work-item2 work-item3 . . . work-item8 ⇒ mapped to 8 SIMD lanes

a11 b11 c11 . . . h11 a12 b12 c12 . . . h12 . . . a1n b1n c1n . . . h1n a21 b21 c21 . . . h21 . . . ann bnn cnn . . . hnn . . .

costly gather/scatter access

optimal: contiguous load/store

A’s 1st row A’s 2nd row B’s 1st row C’s 1st row1st matrix: A 2nd matrix: B 3rd matrix: C 8th matrix: H

8 elements8 rows

8 matrices = 1 SoA ’package’

19 / 30

Page 66: DM-HEOM: ACaseStudyinPerformancePortability

AoS vs. SIMD-friendly AoSoA Memory Layout

a11 a12 a13 . . . a1n a21 . . . a2n . . . ann b11 b12 . . . b1n b21 . . . bnn c11 . . . c1n c21 . . . cnn . . . h11 . . . hnn . . .

work-item1 work-item2 work-item3 . . . work-item8 ⇒ mapped to 8 SIMD lanes

a11 b11 c11 . . . h11 a12 b12 c12 . . . h12 . . . a1n b1n c1n . . . h1n a21 b21 c21 . . . h21 . . . ann bnn cnn . . . hnn . . .

costly gather/scatter access

optimal: contiguous load/store

A’s 1st row A’s 2nd row B’s 1st row C’s 1st row1st matrix: A 2nd matrix: B 3rd matrix: C 8th matrix: H

8 elements8 rows

8 matrices = 1 SoA ’package’

19 / 30

Page 67: DM-HEOM: ACaseStudyinPerformancePortability

AoS vs. SIMD-friendly AoSoA Memory Layout

a11 a12 a13 . . . a1n a21 . . . a2n . . . ann b11 b12 . . . b1n b21 . . . bnn c11 . . . c1n c21 . . . cnn . . . h11 . . . hnn . . .

work-item1 work-item2 work-item3 . . . work-item8 ⇒ mapped to 8 SIMD lanes

a11 b11 c11 . . . h11 a12 b12 c12 . . . h12 . . . a1n b1n c1n . . . h1n a21 b21 c21 . . . h21 . . . ann bnn cnn . . . hnn . . .

costly gather/scatter access

optimal: contiguous load/store

A’s 1st row A’s 2nd row B’s 1st row C’s 1st row1st matrix: A 2nd matrix: B 3rd matrix: C 8th matrix: H

8 elements8 rows

8 matrices = 1 SoA ’package’

19 / 30

Page 68: DM-HEOM: ACaseStudyinPerformancePortability

AoS vs. SIMD-friendly AoSoA Memory Layout

a11 a12 a13 . . . a1n a21 . . . a2n . . . ann b11 b12 . . . b1n b21 . . . bnn c11 . . . c1n c21 . . . cnn . . . h11 . . . hnn . . .

work-item1 work-item2 work-item3 . . . work-item8 ⇒ mapped to 8 SIMD lanes

a11 b11 c11 . . . h11 a12 b12 c12 . . . h12 . . . a1n b1n c1n . . . h1n a21 b21 c21 . . . h21 . . . ann bnn cnn . . . hnn . . .

costly gather/scatter access

optimal: contiguous load/store

A’s 1st row A’s 2nd row B’s 1st row C’s 1st row1st matrix: A 2nd matrix: B 3rd matrix: C 8th matrix: H

8 elements8 rows

8 matrices = 1 SoA ’package’

19 / 30

Page 69: DM-HEOM: ACaseStudyinPerformancePortability

AoS vs. SIMD-friendly AoSoA Memory Layout

a11 a12 a13 . . . a1n a21 . . . a2n . . . ann b11 b12 . . . b1n b21 . . . bnn c11 . . . c1n c21 . . . cnn . . . h11 . . . hnn . . .

work-item1 work-item2 work-item3 . . . work-item8 ⇒ mapped to 8 SIMD lanes

a11 b11 c11 . . . h11 a12 b12 c12 . . . h12 . . . a1n b1n c1n . . . h1n a21 b21 c21 . . . h21 . . . ann bnn cnn . . . hnn . . .

costly gather/scatter access

optimal: contiguous load/store

A’s 1st row A’s 2nd row B’s 1st row C’s 1st row1st matrix: A 2nd matrix: B 3rd matrix: C 8th matrix: H

8 elements8 rows

8 matrices = 1 SoA ’package’

19 / 30

Page 70: DM-HEOM: ACaseStudyinPerformancePortability

Interdisciplinary Workflow domain experts computer scientists

20 / 30

MathematicalModel

- ODEs- PDEs- Graphs- . . .

High-Level Prototype(Mathematica)

- domain scientist’s tool- high level- symbolic solvers- arbitrary precision- very limited performance

OpenCL kernelwithin Mathematica

- replace some codewith OpenCL

- compare results- figure out numerics- use accelerators in MM

. . .

. . . C++ App. withOpenCL kernel

- start single node- OpenCL 1.2 for hotspots- modern C++ 11/14/17- CMake for building

DistributedApplication (MPI)

Page 71: DM-HEOM: ACaseStudyinPerformancePortability

Interdisciplinary Workflow domain experts computer scientists

20 / 30

MathematicalModel

- ODEs- PDEs- Graphs- . . .

High-Level Prototype(Mathematica)

- domain scientist’s tool- high level- symbolic solvers- arbitrary precision- very limited performance

OpenCL kernelwithin Mathematica

- replace some codewith OpenCL

- compare results- figure out numerics- use accelerators in MM

. . .

. . . C++ App. withOpenCL kernel

- start single node- OpenCL 1.2 for hotspots- modern C++ 11/14/17- CMake for building

DistributedApplication (MPI)

- scale to multiple nodes- partitioning,neighbour exchange, . . .

- wrapped MPI 3.0

Page 72: DM-HEOM: ACaseStudyinPerformancePortability

Partitioning the Hierarchy• Problem: map hierarchy graph nodes to n partitions (compute nodes)

• minimise communication• minimise load imbalance

⇒ GP is NP-hard⇒ hierarchy graph is highly connected

21 / 30

Page 73: DM-HEOM: ACaseStudyinPerformancePortability

Partitioning the Hierarchy• Problem: map hierarchy graph nodes to n partitions (compute nodes)

• minimise communication• minimise load imbalance

⇒ GP is NP-hard⇒ hierarchy graph is highly connected

21 / 30

Page 74: DM-HEOM: ACaseStudyinPerformancePortability

Partitioning the Hierarchy• Problem: map hierarchy graph nodes to n partitions (compute nodes)

• minimise communication• minimise load imbalance

⇒ GP is NP-hard⇒ hierarchy graph is highly connected

21 / 30

Page 75: DM-HEOM: ACaseStudyinPerformancePortability

Partitioning the Hierarchy• Problem: map hierarchy graph nodes to n partitions (compute nodes)

• minimise communication• minimise load imbalance

⇒ GP is NP-hard⇒ hierarchy graph is highly connected

21 / 30

Page 76: DM-HEOM: ACaseStudyinPerformancePortability

Partitioning the Hierarchy (PS-I with d = 3)• METIS generated partitionings

• partitioning quality mostly resilient to METIS settings

part- neighbor parts. hierarchy nodes per part. (avg)itions avg min max nodes shared % halo overhead %

16 15 15 15 9803 91.7% 12508 127.6%32 31 31 31 4902 95.0% 7858 160.3%64 62 60 63 2451 96.8% 4739 193.2%

128 127 125 127 1225 98.0% 2958 241.4%256 241 212 255 613 98.8% 1689 274.8%512 281 124 414 306 99.0% 929 303.3%

22 / 30

Page 77: DM-HEOM: ACaseStudyinPerformancePortability

Partitioning the Hierarchy (PS-I with d = 3)• METIS generated partitionings

• partitioning quality mostly resilient to METIS settings

part- neighbor parts. hierarchy nodes per part. (avg)itions avg min max nodes shared % halo overhead %

16 15 15 15 9803 91.7% 12508 127.6%32 31 31 31 4902 95.0% 7858 160.3%64 62 60 63 2451 96.8% 4739 193.2%

128 127 125 127 1225 98.0% 2958 241.4%256 241 212 255 613 98.8% 1689 274.8%512 281 124 414 306 99.0% 929 303.3%

22 / 30

Page 78: DM-HEOM: ACaseStudyinPerformancePortability

Partitioning the Hierarchy (PS-I with d = 3)• METIS generated partitionings

• partitioning quality mostly resilient to METIS settings

part- neighbor parts. hierarchy nodes per part. (avg)itions avg min max nodes shared % halo overhead %

16 15 15 15 9803 91.7% 12508 127.6%32 31 31 31 4902 95.0% 7858 160.3%64 62 60 63 2451 96.8% 4739 193.2%

128 127 125 127 1225 98.0% 2958 241.4%256 241 212 255 613 98.8% 1689 274.8%512 281 124 414 306 99.0% 929 303.3%

⇒ highly connected partitioning graph⇒ almost all-to-all

22 / 30

Page 79: DM-HEOM: ACaseStudyinPerformancePortability

Partitioning the Hierarchy (PS-I with d = 3)• METIS generated partitionings

• partitioning quality mostly resilient to METIS settings

part- neighbor parts. hierarchy nodes per part. (avg)itions avg min max nodes shared % halo overhead %

16 15 15 15 9803 91.7% 12508 127.6%32 31 31 31 4902 95.0% 7858 160.3%64 62 60 63 2451 96.8% 4739 193.2%

128 127 125 127 1225 98.0% 2958 241.4%256 241 212 255 613 98.8% 1689 274.8%512 281 124 414 306 99.0% 929 303.3%

⇒ almost all nodes required by other partitions⇒ prevents inner/outer communication/computation overlap

22 / 30

Page 80: DM-HEOM: ACaseStudyinPerformancePortability

Partitioning the Hierarchy (PS-I with d = 3)• METIS generated partitionings

• partitioning quality mostly resilient to METIS settings

part- neighbor parts. hierarchy nodes per part. (avg)itions avg min max nodes shared % halo overhead %

16 15 15 15 9803 91.7% 12508 127.6%32 31 31 31 4902 95.0% 7858 160.3%64 62 60 63 2451 96.8% 4739 193.2%

128 127 125 127 1225 98.0% 2958 241.4%256 241 212 255 613 98.8% 1689 274.8%512 281 124 414 306 99.0% 929 303.3%

⇒ more halo nodes than local nodes

22 / 30

Page 81: DM-HEOM: ACaseStudyinPerformancePortability

Distributed Memory HEOM: Multi-Node C++ Application

PartitionMapping

HEOM Configuration

Hierarchy Graph

Hierarchy Partition

Instance

Solver ODE register neighborexchange action

Communicator

OpenCLConfig

HEOMConfig

Results

• simple transition tomulti-node application

• Partition Mapping asadditional input (METIS)

• generated Hierarchy Partitionas new input for Instance⇒ Instance and Solver do not care

• Communicator encapsulates MPI-3• Neighborhood Collectives• Derived Data Types⇒ fully declarative communication API

• ODE can trigger action⇒ neighbor exchange prior to evaluation

23 / 30

Page 82: DM-HEOM: ACaseStudyinPerformancePortability

Distributed Memory HEOM: Multi-Node C++ Application

PartitionMapping

HEOM Configuration

Hierarchy Graph

Hierarchy Partition

Instance

Solver ODE register neighborexchange action

Communicator

OpenCLConfig

HEOMConfig

Results

• simple transition tomulti-node application

• Partition Mapping asadditional input (METIS)

• generated Hierarchy Partitionas new input for Instance⇒ Instance and Solver do not care

• Communicator encapsulates MPI-3• Neighborhood Collectives• Derived Data Types⇒ fully declarative communication API

• ODE can trigger action⇒ neighbor exchange prior to evaluation

23 / 30

Page 83: DM-HEOM: ACaseStudyinPerformancePortability

Distributed Memory HEOM: Multi-Node C++ Application

PartitionMapping

HEOM Configuration

Hierarchy Graph

Hierarchy Partition

Instance

Solver ODE register neighborexchange action

Communicator

OpenCLConfig

HEOMConfig

Results

• simple transition tomulti-node application

• Partition Mapping asadditional input (METIS)

• generated Hierarchy Partitionas new input for Instance⇒ Instance and Solver do not care

• Communicator encapsulates MPI-3• Neighborhood Collectives• Derived Data Types⇒ fully declarative communication API

• ODE can trigger action⇒ neighbor exchange prior to evaluation

23 / 30

Page 84: DM-HEOM: ACaseStudyinPerformancePortability

Distributed Memory HEOM: Multi-Node C++ Application

PartitionMapping

HEOM Configuration

Hierarchy Graph

Hierarchy Partition

Instance

Solver ODE register neighborexchange action

Communicator

OpenCLConfig

HEOMConfig

Results

• simple transition tomulti-node application

• Partition Mapping asadditional input (METIS)

• generated Hierarchy Partitionas new input for Instance⇒ Instance and Solver do not care

• Communicator encapsulates MPI-3• Neighborhood Collectives• Derived Data Types⇒ fully declarative communication API

• ODE can trigger action⇒ neighbor exchange prior to evaluation

23 / 30

Page 85: DM-HEOM: ACaseStudyinPerformancePortability

Distributed Memory HEOM: Multi-Node C++ Application

PartitionMapping

HEOM Configuration

Hierarchy Graph

Hierarchy Partition

Instance

Solver ODE register neighborexchange action

Communicator

OpenCLConfig

HEOMConfig

Results

• simple transition tomulti-node application

• Partition Mapping asadditional input (METIS)

• generated Hierarchy Partitionas new input for Instance⇒ Instance and Solver do not care

• Communicator encapsulates MPI-3• Neighborhood Collectives• Derived Data Types⇒ fully declarative communication API

• ODE can trigger action⇒ neighbor exchange prior to evaluation

23 / 30

Page 86: DM-HEOM: ACaseStudyinPerformancePortability

Distributed Memory HEOM: Multi-Node C++ Application

PartitionMapping

HEOM Configuration

Hierarchy Graph

Hierarchy Partition

Instance

Solver ODE register neighborexchange action

Communicator

OpenCLConfig

HEOMConfig

Results

• simple transition tomulti-node application

• Partition Mapping asadditional input (METIS)

• generated Hierarchy Partitionas new input for Instance⇒ Instance and Solver do not care

• Communicator encapsulates MPI-3• Neighborhood Collectives• Derived Data Types⇒ fully declarative communication API

• ODE can trigger action⇒ neighbor exchange prior to evaluation

23 / 30

Page 87: DM-HEOM: ACaseStudyinPerformancePortability

Interdisciplinary Workflow domain experts computer scientists

24 / 30

MathematicalModel

- ODEs- PDEs- Graphs- . . .

High-Level Prototype(Mathematica)

- domain scientist’s tool- high level- symbolic solvers- arbitrary precision- very limited performance

OpenCL kernelwithin Mathematica

- replace some codewith OpenCL

- compare results- figure out numerics- use accelerators in MM

. . .

. . . C++ App. withOpenCL kernel

- start single node- OpenCL 1.2 for hotspots- modern C++ 11/14/17- CMake for building

DistributedApplication (MPI)

- scale to multiple nodes- partitioning,neighbour exchange, . . .

- wrapped MPI 3.0

Optimisation /Production Runs

Page 88: DM-HEOM: ACaseStudyinPerformancePortability

Interdisciplinary Workflow domain experts computer scientists

24 / 30

MathematicalModel

- ODEs- PDEs- Graphs- . . .

High-Level Prototype(Mathematica)

- domain scientist’s tool- high level- symbolic solvers- arbitrary precision- very limited performance

OpenCL kernelwithin Mathematica

- replace some codewith OpenCL

- compare results- figure out numerics- use accelerators in MM

. . .

. . . C++ App. withOpenCL kernel

- start single node- OpenCL 1.2 for hotspots- modern C++ 11/14/17- CMake for building

DistributedApplication (MPI)

- scale to multiple nodes- partitioning,neighbour exchange, . . .

- wrapped MPI 3.0

Optimisation /Production Runs

- always collect perf. data- profile/tune code- explore new architectures

Page 89: DM-HEOM: ACaseStudyinPerformancePortability

Benchmarks: Work-item Granularity

fmo_22baths_d3.cfg lhcii_1bath_d8.cfg

0

2000

4000

6000

2 SKL 2 HSW KNL K40 W8100 2 SKL 2 HSW KNL K40 W8100aver

age

solve

r ste

p ru

ntim

e [m

s]

GranularityMatrix

Element

Impact of Work-item Granularity

25 / 30

Page 90: DM-HEOM: ACaseStudyinPerformancePortability

Benchmarks: Work-item Granularity

fmo_22baths_d3.cfg lhcii_1bath_d8.cfg

0

2000

4000

6000

2 SKL 2 HSW KNL K40 W8100 2 SKL 2 HSW KNL K40 W8100aver

age

solve

r ste

p ru

ntim

e [m

s]

GranularityMatrix

Element

Impact of Work-item Granularity

⇒ CPUs: 1.2× to 1.35× speedup for Matrix granularity25 / 30

Page 91: DM-HEOM: ACaseStudyinPerformancePortability

Benchmarks: Work-item Granularity

fmo_22baths_d3.cfg lhcii_1bath_d8.cfg

0

2000

4000

6000

2 SKL 2 HSW KNL K40 W8100 2 SKL 2 HSW KNL K40 W8100aver

age

solve

r ste

p ru

ntim

e [m

s]

GranularityMatrix

Element

Impact of Work-item Granularity

⇒ GPUs: up to 6.7× (K40) and 7.2× (W8100) speedup for Element granularity25 / 30

Page 92: DM-HEOM: ACaseStudyinPerformancePortability

Benchmarks: Memory Layout

fmo_22baths_d3.cfg lhcii_1bath_d8.cfg

0

500

1000

1500

2 SKL 2 HSW KNL 2 SKL 2 HSW KNLaver

age

solve

r ste

p ru

ntim

e [m

s]

LayoutAoS

AoSoA

Impact of Configurable Memory Layout

26 / 30

Page 93: DM-HEOM: ACaseStudyinPerformancePortability

Benchmarks: Memory Layout

fmo_22baths_d3.cfg lhcii_1bath_d8.cfg

0

500

1000

1500

2 SKL 2 HSW KNL 2 SKL 2 HSW KNLaver

age

solve

r ste

p ru

ntim

e [m

s]

LayoutAoS

AoSoA

Impact of Configurable Memory Layout

⇒ SKL and HSW: 1.3× to 2.4× speedup with AoSoA26 / 30

Page 94: DM-HEOM: ACaseStudyinPerformancePortability

Benchmarks: Memory Layout

fmo_22baths_d3.cfg lhcii_1bath_d8.cfg

0

500

1000

1500

2 SKL 2 HSW KNL 2 SKL 2 HSW KNLaver

age

solve

r ste

p ru

ntim

e [m

s]

LayoutAoS

AoSoA

Impact of Configurable Memory Layout

⇒ KNL: 1.6× to 2.8× speedup with AoSoA26 / 30

Page 95: DM-HEOM: ACaseStudyinPerformancePortability

Benchmarks: Performance Portability

fmo_22baths_d3.cfg lhcii_1bath_d8.cfg

0

500

1000

SKL HSW KNL K40 W8100 SKL HSW KNL K40 W8100aver

age

solve

r ste

p ru

ntim

e [m

s] Hardware2x Xeon (SKL)

2x Xeon (HSW)

ex. from FLOPS

Xeon Phi (KNL)

ex. from FLOPS

Tesla K40

ex. from FLOPS

FirePro W8100

ex. from FLOPS

Performance Portability Relative to Xeon (SKL)

27 / 30

Page 96: DM-HEOM: ACaseStudyinPerformancePortability

Benchmarks: Performance Portability

fmo_22baths_d3.cfg lhcii_1bath_d8.cfg

0

500

1000

SKL HSW KNL K40 W8100 SKL HSW KNL K40 W8100aver

age

solve

r ste

p ru

ntim

e [m

s] Hardware2x Xeon (SKL)

2x Xeon (HSW)

ex. from FLOPS

Xeon Phi (KNL)

ex. from FLOPS

Tesla K40

ex. from FLOPS

FirePro W8100

ex. from FLOPS

Performance Portability Relative to Xeon (SKL)

⇒ SKL (Xeon) is the reference27 / 30

Page 97: DM-HEOM: ACaseStudyinPerformancePortability

Benchmarks: Performance Portability

fmo_22baths_d3.cfg lhcii_1bath_d8.cfg

0

500

1000

SKL HSW KNL K40 W8100 SKL HSW KNL K40 W8100aver

age

solve

r ste

p ru

ntim

e [m

s] Hardware2x Xeon (SKL)

2x Xeon (HSW)

ex. from FLOPS

Xeon Phi (KNL)

ex. from FLOPS

Tesla K40

ex. from FLOPS

FirePro W8100

ex. from FLOPS

Performance Portability Relative to Xeon (SKL)

⇒ gray bars are expected runtimes extrapolated from peak FLOPS27 / 30

Page 98: DM-HEOM: ACaseStudyinPerformancePortability

Benchmarks: Performance Portability

fmo_22baths_d3.cfg lhcii_1bath_d8.cfg

0

500

1000

SKL HSW KNL K40 W8100 SKL HSW KNL K40 W8100aver

age

solve

r ste

p ru

ntim

e [m

s] Hardware2x Xeon (SKL)

2x Xeon (HSW)

ex. from FLOPS

Xeon Phi (KNL)

ex. from FLOPS

Tesla K40

ex. from FLOPS

FirePro W8100

ex. from FLOPS

Performance Portability Relative to Xeon (SKL)

⇒ Older Haswell Xeon exceeds expectations, due to better OpenCL support27 / 30

Page 99: DM-HEOM: ACaseStudyinPerformancePortability

Benchmarks: Performance Portability

fmo_22baths_d3.cfg lhcii_1bath_d8.cfg

0

500

1000

SKL HSW KNL K40 W8100 SKL HSW KNL K40 W8100aver

age

solve

r ste

p ru

ntim

e [m

s] Hardware2x Xeon (SKL)

2x Xeon (HSW)

ex. from FLOPS

Xeon Phi (KNL)

ex. from FLOPS

Tesla K40

ex. from FLOPS

FirePro W8100

ex. from FLOPS

Performance Portability Relative to Xeon (SKL)

⇒ Good: within 30 % of expectation27 / 30

Page 100: DM-HEOM: ACaseStudyinPerformancePortability

Benchmarks: Performance Portability

fmo_22baths_d3.cfg lhcii_1bath_d8.cfg

0

500

1000

SKL HSW KNL K40 W8100 SKL HSW KNL K40 W8100aver

age

solve

r ste

p ru

ntim

e [m

s] Hardware2x Xeon (SKL)

2x Xeon (HSW)

ex. from FLOPS

Xeon Phi (KNL)

ex. from FLOPS

Tesla K40

ex. from FLOPS

FirePro W8100

ex. from FLOPS

Performance Portability Relative to Xeon (SKL)

⇒ KNL and K40 sensitive to irregular accesses from extreme coupling in this scenario.27 / 30

Page 101: DM-HEOM: ACaseStudyinPerformancePortability

Benchmarks: Scalability

0

500

1000

1500

2000

2500

16 32 64 128 256 512nodes

runt

ime

[ms]

1

10

16 32 64 128 256 512nodes

spee

dup

vs. 1

6 no

des

Strong Scaling of PS I with 3 Layers

communication compute sum

28 / 30

Page 102: DM-HEOM: ACaseStudyinPerformancePortability

Benchmarks: Scalability

0

500

1000

1500

2000

2500

16 32 64 128 256 512nodes

runt

ime

[ms]

1

10

16 32 64 128 256 512nodes

spee

dup

vs. 1

6 no

des

Strong Scaling of PS I with 3 Layers

communication compute sum

⇒ communication cost > computation cost⇒ better node utilisation means worse scalability

28 / 30

Page 103: DM-HEOM: ACaseStudyinPerformancePortability

Benchmarks: Scalability

0

500

1000

1500

2000

2500

16 32 64 128 256 512nodes

runt

ime

[ms]

1

10

16 32 64 128 256 512nodes

spee

dup

vs. 1

6 no

des

Strong Scaling of PS I with 3 Layers

communication compute sum

⇒compute

scales

ideally

28 / 30

Page 104: DM-HEOM: ACaseStudyinPerformancePortability

Benchmarks: Scalability

0

500

1000

1500

2000

2500

16 32 64 128 256 512nodes

runt

ime

[ms]

1

10

16 32 64 128 256 512nodes

spee

dup

vs. 1

6 no

des

Strong Scaling of PS I with 3 Layers

communication compute sum

⇒ all-to-all likeneighbour-exch.limits scalability

28 / 30

Page 105: DM-HEOM: ACaseStudyinPerformancePortability

Benchmarks: Scalability

0

500

1000

1500

2000

2500

16 32 64 128 256 512nodes

runt

ime

[ms]

1

10

16 32 64 128 256 512nodes

spee

dup

vs. 1

6 no

des

Strong Scaling of PS I with 3 Layers

communication compute sum

⇒ 14.1× speedup16 → 512 nodes

28 / 30

Page 106: DM-HEOM: ACaseStudyinPerformancePortability

Summary

Lessons’s learned:• interdisciplinary workflow is key for developing HPC codes• standards (OpenCL, MPI-3, . . . ) enable portability• a flexible design enables portable performance⇒ leverage runtime compilation⇒ work-item granularity and memory and layout

DM-HEOM:• first Distributed Memory HEOM implementation• pushes the boundary of feasible problem sizes• practical scalability from laptops to supercomputers

29 / 30

Page 107: DM-HEOM: ACaseStudyinPerformancePortability

Summary

Lessons’s learned:• interdisciplinary workflow is key for developing HPC codes• standards (OpenCL, MPI-3, . . . ) enable portability• a flexible design enables portable performance⇒ leverage runtime compilation⇒ work-item granularity and memory and layout

DM-HEOM:• first Distributed Memory HEOM implementation• pushes the boundary of feasible problem sizes• practical scalability from laptops to supercomputers

29 / 30

Page 108: DM-HEOM: ACaseStudyinPerformancePortability

EoP

Thank you.Feedback? Questions? Ideas?

[email protected]

This work was funded by the Deutsche Forschungsgemeinschaft (DFG), projects RE 1389/8-1 and KR 2889/7-1. Additional funding came from Intel Corp. within the activities of the"Research Center for Many-Core HPC" at ZIB, an Intel Parallel Computing Center. Theauthors acknowledge the North-German Supercomputing Alliance (HLRN) for providingcomputing time on the Cray XC40.

30 / 30