X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

Preview:

DESCRIPTION

DynAX Innovations in Programming Models, Compilers and Runtime Systems for Dynamic Adaptive Event-Driven Execution Models. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013. Objectives. Brandywine Xstack Software Stack. NWChem + Co-Design Applications. - PowerPoint PPT Presentation

Citation preview

ET

E.T. International, Inc.

X-Stack: Programming Challenges, Runtime Systems, and Tools

Brandywine TeamMay2013

DynAXInnovations in Programming Models, Compilers and Runtime Systems for

Dynamic Adaptive Event-Driven Execution Models

E.T. International, Inc.

2

ObjectivesScalability Expose, express, and exploit O(1010) concurrencyLocality Locality aware data types, algorithms, and

optimizationsProgrammability

Easy expression of asynchrony, concurrency, locality

Portability Stack portability across heterogeneous architecturesEnergy Efficiency

Maximize static and dynamic energy savings while managing the tradeoff between energy efficiency, resilience, and performance

Resilience Gradual degradation in the face of many faultsInteroperability

Leverage legacy code through a gradual transformation towards exascale performance

Applications Support NWChem

E.T. International, Inc.

3

Brandywine Xstack Software Stack

SWARM(Runtime System)

SCALE(Compiler)

HTA (Library)

R-Stream(Compiler)

NWChem + Co-Design Applications

Rescinded Primitive Data Types .

E.T. International, Inc.

E.T. International, Inc.

E.T. International, Inc.

SWARMMPI, OpenMP, OpenCL SWARM

Asynchronous Event-Driven Tasks Dependencies Resources Active Messages Control Migration

VS.

Communicating Sequential Processes Bulk Synchronous Message Passing

Tim

e Time

Active threads

Waiting

4

E.T. International, Inc.

5

SWARM• Principles of Operation

Codelets* Basic unit of parallelism* Nonblocking tasks* Scheduled upon satisfaction of precedent constraints

Hierarchical Locale Tree: spatial position, data locality Lightweight Synchronization Active Global Address Space (planned)

• Dynamics Asynchronous Split-phase Transactions: latency hiding Message Driven Computation Control-flow and Dataflow Futures Error Handling Fault Tolerance (planned)

E.T. International, Inc.

Cholesky DAG•POTRF → TRSM•TRSM → GEMM, SYRK•SYRK → POTRF

POTRF TRSM SYRK GEMM1:

2:

3:

6

• Implementations:OpenMPSWARM

E.T. International, Inc.

Naïve O

penMP

Tuned OpenM

PSW

AR

M

Cholesky Decomposition: Xeon

1 2 3 4 5 6 7 8 9 101112123456789

101112

OpenMPSWARM

# Threads

Spee

dup

over

Ser

ial

7

E.T. International, Inc.

Cholesky Decomposition: Xeon Phi

8

Ope

nMP

SWA

RM

Xeon Phi: 240 Threads

OpenMP fork-join programming suffers on many-core chips (e.g. Xeon Phi). SWARM removes these synchronizations.

E.T. International, Inc.

Cholesky: SWARM vs ScaLapack/MKLSc

aLap

ack

SWA

RM

16 node cluster: Intel Xeon E5-2670 16-core 2.6GHz

Asynchrony is key in large dense linear algebra

2 4 8 16 32 640

2000

4000

6000

8000

10000

12000

14000

16000

ScaLapack/MKL

SWARM

# Nodes

GFLO

PS9

E.T. International, Inc.

Code Transition to Exascale1. Determine application execution, communication, and

data access patterns2. Find ways to accelerate application execution directly.3. Consider data access pattern to better lay out data

across distributed heterogeneous nodes.4. Convert single-node synchronization to asynchronous

control-flow/data-flow (OpenMP -> asynchronous scheduling)

5. Remove bulk-synchronous communications where possible (MPI -> asynchronous communication)

6. Synergize inter-node and intra-node code7. Determine further optimizations afforded by

asynchronous model.Method successfully deployed for NWChem code transition

10

E.T. International, Inc.

Self Consistent Field Module From NWChem

•NWChem used by 1000’s of researchers•Code is designed to be highly scalable to petaflop scale

•Thousands of man-hours expensed on tuning and performance

•Self Consistent Field (SCF) module is a key component of NWChem

•ETI has worked with PNNL to extract the algorithm from NWChem to study how to improve it.As part of the DOE XStack program

11

E.T. International, Inc.

Serial Optimizations

Origina

l

Symmetr

y of g

()

BLAS/L

APACK

Precom

pute_x

000d

_g val

ues

Fock M

atrix S

ymmetr

y02468

1012141618

Serial OptimizationsSp

eedu

p

12

E.T. International, Inc.

Single Node Parallelization

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

02468

10121416182022

Speedup of OpenMP versions and SWARM

Dynamic,v1Dynamic,v2Dynamic,v3Guided,v1Guided,v2Guided,v3Static,v1Static,v2Static,v3SWARMIdeal

# Threads

Spee

dup

13

E.T. International, Inc.

Multi-Node Parallelization

16 32 64 128 256 512 1024 204810

100

1000

10000

SCF Multi-Node Execution Scaling

SWARMMPI

# Cores

Exec

utio

n Ti

me

(sec

onds

)

16 32 64 128 256 512 1024 2048

0.1

1

10

100

SCF Multinode Speedup

SWARMMPI

# Cores

Spee

dup

over

Sin

gle

Nod

e

14

E.T. International, Inc.

15

Information Repository• All of this information is available in more detail at the

Xstack wiki:http://www.xstackwiki.com

E.T. International, Inc.

16

Questions?

E.T. International, Inc.

17

Acknowledgements• Co-PIs:

Benoit Meister (Reservoir)David Padua (Univ. Illinois) John Feo (PNNL)

• Other team members:ETI: Mark Glines, Kelly Livingston, Adam MarkeyReservoir: Rich LethinUniv. Illinois: Adam SmithPNNL: Andres Marquez

• DOESonia Sachs, Bill Harrod