25
Acknowledgments Autotuning and Adaptivity Approach for Energy Efficient HPC Systems: The ANTAREX Toolbox Cristina Silvano, Politecnico di Milano http://www.antarex-project.eu/

Autotuningand AdaptivityApproach for Energy Efficient HPC ...eethpc.net/wp-content/uploads/2018/06/ANTAREX_ISC18-FINAL.pdf · • DARPA’s target of 20MW • 8xenergy efficiency

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Autotuningand AdaptivityApproach for Energy Efficient HPC ...eethpc.net/wp-content/uploads/2018/06/ANTAREX_ISC18-FINAL.pdf · • DARPA’s target of 20MW • 8xenergy efficiency

Acknowledgments

Autotuning and Adaptivity Approach for Energy Efficient HPC Systems: The ANTAREX Toolbox

Cristina Silvano, Politecnico di Milano

http://www.antarex-project.eu/

Page 2: Autotuningand AdaptivityApproach for Energy Efficient HPC ...eethpc.net/wp-content/uploads/2018/06/ANTAREX_ISC18-FINAL.pdf · • DARPA’s target of 20MW • 8xenergy efficiency

Energy-efficient exascalesupercomputers • DARPA’s target of 20MW• 4x energy efficiency from

around 13 to 50 GFLOPS/W• Summit (June 2018) 1st-Top500

5th-Green500 122 PetaFLOPS13.8 GFLOPS/W

ANTAREX: The context

Programmability and easy application tuning• Develop HPC applications is a

complex tasks, which requires• several specialized languages• tools for performance tuning

• Incompatible with the current trend to open HPC infrastructures to a wide range of users Next generation supercomputers need to be 

coupled with a radically new software stackcapable of exploiting the benefits offered by 

heterogeneity to meet programmability, scalability and energy efficiency required by the Exascale era. 

Page 3: Autotuningand AdaptivityApproach for Energy Efficient HPC ...eethpc.net/wp-content/uploads/2018/06/ANTAREX_ISC18-FINAL.pdf · • DARPA’s target of 20MW • 8xenergy efficiency

Energy-efficient exascalesupercomputers • DARPA’s target of 20MW• 8x energy efficiency from

around 6 to 50 GFlops/W• Sunway TaihuLight (Nov. 2017)

1st-Top500 20th-Green500

ANTAREX: The context

Programmability and easy application tuning• Develop HPC applications is a

complex tasks, which requires• several specialized languages• tools for performance tuning

• Incompatible with the current trend to open HPC infrastructures to a wider range of users Next generation supercomputers need to be 

coupled with a radically new software stackcapable of exploiting the benefits offered by 

heterogeneity to meet programmability, scalability and energy efficiency required by the Exascale era. 

ANTAREX wants to provide a breakthrough approach • to express by a Domain Specific Language the

application self‐adaptivity and extra‐functional requirements

• to runtime manage, monitor and autotune applicationsfor green HPC systems up to the Exascale level 

ANTAREX wants to provide a breakthrough approach • to express by a Domain Specific Language the

application self‐adaptivity and extra‐functional requirements

• to runtime manage, monitor and autotune applicationsfor green HPC systems up to the Exascale level 

Page 4: Autotuningand AdaptivityApproach for Energy Efficient HPC ...eethpc.net/wp-content/uploads/2018/06/ANTAREX_ISC18-FINAL.pdf · • DARPA’s target of 20MW • 8xenergy efficiency

ANTAREX Tool flowC/C++ w/ OpenMP, MPI, OpenCL

functional description

Page 5: Autotuningand AdaptivityApproach for Energy Efficient HPC ...eethpc.net/wp-content/uploads/2018/06/ANTAREX_ISC18-FINAL.pdf · • DARPA’s target of 20MW • 8xenergy efficiency

ANTAREX Tool flowC/C++ w/ OpenMP, MPI, OpenCL

functional description

mARGOt

LARA/Clava

ExaMon

libVersioningCompiler

Page 6: Autotuningand AdaptivityApproach for Energy Efficient HPC ...eethpc.net/wp-content/uploads/2018/06/ANTAREX_ISC18-FINAL.pdf · • DARPA’s target of 20MW • 8xenergy efficiency

LARA DSL for Extra-Functional Specifications

DSL WEAVER

Functional Description Extra‐Functional Requirements E.g.  Adaptivity features, Tuning Knobs,

Power/Performance constraints

DSL LARAC/C++ w/ OpenMP, 

MPI, OpenCL

• Enable separation of concerns: functional and extra-functional descriptions are decoupled.

• Useful to express strategies forinstrumentation and profiling

• Support for explore compileroptimization space, accordingto code and target architectures

• Enable more advanced controlthan using pragmas/directives/ switches (e.g. considering application parameters and mapping control)

• Enable separation of concerns: functional and extra-functional descriptions are decoupled.

• Useful to express strategies forinstrumentation and profiling

• Support for explore compileroptimization space, accordingto code and target architectures

• Enable more advanced controlthan using pragmas/directives/ switches (e.g. considering application parameters and mapping control)

Page 7: Autotuningand AdaptivityApproach for Energy Efficient HPC ...eethpc.net/wp-content/uploads/2018/06/ANTAREX_ISC18-FINAL.pdf · • DARPA’s target of 20MW • 8xenergy efficiency

LARA Design Benefits from Aspect Orientation Reusable Strategies

Custom Targetability

Design Exploration

Page 8: Autotuningand AdaptivityApproach for Energy Efficient HPC ...eethpc.net/wp-content/uploads/2018/06/ANTAREX_ISC18-FINAL.pdf · • DARPA’s target of 20MW • 8xenergy efficiency

LARA-based Tool Flow

Application (C, Java, MATLAB)

Aspects / Strategies (LARA)

Library of Aspects / Strategies

Compiler Toolset

Code Output Analysis Output

Page 9: Autotuningand AdaptivityApproach for Energy Efficient HPC ...eethpc.net/wp-content/uploads/2018/06/ANTAREX_ISC18-FINAL.pdf · • DARPA’s target of 20MW • 8xenergy efficiency

LARA Instrumentation Example

Page 10: Autotuningand AdaptivityApproach for Energy Efficient HPC ...eethpc.net/wp-content/uploads/2018/06/ANTAREX_ISC18-FINAL.pdf · • DARPA’s target of 20MW • 8xenergy efficiency

LARA Instrumentation Example

Used to manage precision tuning, application and compiler autotuning, code transformations, code 

versioning, memoization, application monitoring, etc.

Used to manage precision tuning, application and compiler autotuning, code transformations, code 

versioning, memoization, application monitoring, etc.

Page 11: Autotuningand AdaptivityApproach for Energy Efficient HPC ...eethpc.net/wp-content/uploads/2018/06/ANTAREX_ISC18-FINAL.pdf · • DARPA’s target of 20MW • 8xenergy efficiency

Loop Unrolling: Aspects and Strategies

LARA: “recipes” for compiler optimizations

aspectdef LoopUnroll

select loop end

apply

if($loop.num_iterations <= 32) {

$loop.exec Unroll(0);

} else {

$loop.exec Unroll(2);

}

end

condition

$loop.is_innermost &&

$loop.type=="for"

end

end

Advices (actions)

Program elements 

Condition

Page 12: Autotuningand AdaptivityApproach for Energy Efficient HPC ...eethpc.net/wp-content/uploads/2018/06/ANTAREX_ISC18-FINAL.pdf · • DARPA’s target of 20MW • 8xenergy efficiency

Apply dynamic choices from algorithm parameters to compiler and mapping optimizationsForms of runtime adaptivity include:

• Modifications of application parameters (attributes)• Selection among different algorithms for solving the same problem• Different compiler optimizations for the same algorithm• Runtime strategies for partitioning and for mapping computations

targeting hardware accelerators• Runtime management of system resources

LARA: Runtime Adaptivity Dimensions

Extending LARA with native support for runtime adaptivity strategies

Page 13: Autotuningand AdaptivityApproach for Energy Efficient HPC ...eethpc.net/wp-content/uploads/2018/06/ANTAREX_ISC18-FINAL.pdf · • DARPA’s target of 20MW • 8xenergy efficiency

ANTAREX Runtime Framework

CoolingNode1 Node2 NodeNNodei

Rack1 RackK

J1J2

J3JN

Job SchedulerAllocator

J1 JN

M M M M

AM

P P P P

COff

C

AA AM

COn

A

C

P

COff

COn

MAM

Autotuner

Collector

Power Manager

Offline Compiler

Online Compiler

Hardware Monitor

Application Monitor

Compile Time

Deploy Time

Run Time

COn

M

C DB

Page 14: Autotuningand AdaptivityApproach for Energy Efficient HPC ...eethpc.net/wp-content/uploads/2018/06/ANTAREX_ISC18-FINAL.pdf · • DARPA’s target of 20MW • 8xenergy efficiency

mARGOt: ANTAREX Application AutotunermARGOt enables self-optimization at application-level by:

– SW Knobs tunable at runtime (such as parallelism and code versioning)

– Application knobs to tune the approximation by trading off accuracy vs throughput: output just needs to be “good enough”

– Feed-back loop to make the application behaviour self-aware

Page 15: Autotuningand AdaptivityApproach for Energy Efficient HPC ...eethpc.net/wp-content/uploads/2018/06/ANTAREX_ISC18-FINAL.pdf · • DARPA’s target of 20MW • 8xenergy efficiency

ExaMon: Monitoring System & Power Management

DB

CassandraKairosDBCassandraKairosDB

Page 16: Autotuningand AdaptivityApproach for Energy Efficient HPC ...eethpc.net/wp-content/uploads/2018/06/ANTAREX_ISC18-FINAL.pdf · • DARPA’s target of 20MW • 8xenergy efficiency

ANTAREX Power Management

Power capping strategies to select the best performance point for each core given a power constraint.

Page 17: Autotuningand AdaptivityApproach for Energy Efficient HPC ...eethpc.net/wp-content/uploads/2018/06/ANTAREX_ISC18-FINAL.pdf · • DARPA’s target of 20MW • 8xenergy efficiency

AutoTuning Geometric Docking for HPC-based Drug Discovery

Page 18: Autotuningand AdaptivityApproach for Energy Efficient HPC ...eethpc.net/wp-content/uploads/2018/06/ANTAREX_ISC18-FINAL.pdf · • DARPA’s target of 20MW • 8xenergy efficiency

Personalized Medicine will enable to “treat the right patient with the right drug at the right dose at the right time." [FDA]

Use Case 1: HPC Accelerated Drug Discovery System

Need of HPC in Drug Discovery: HPC Molecular Simulations

Page 19: Autotuningand AdaptivityApproach for Energy Efficient HPC ...eethpc.net/wp-content/uploads/2018/06/ANTAREX_ISC18-FINAL.pdf · • DARPA’s target of 20MW • 8xenergy efficiency

Molecular Docking is a method to calculate the preferred position and shape of one molecule to a second when bound to each other• Geometric Docking

• Shape Complementarity: geometric matching search to find out compatible pairs and most suitable poses

• Pharmacophoric Docking• Molecular Simulation: exploration of a large energy

landscape determined by chemical and physical interactions

Molecular Docking

Page 20: Autotuningand AdaptivityApproach for Energy Efficient HPC ...eethpc.net/wp-content/uploads/2018/06/ANTAREX_ISC18-FINAL.pdf · • DARPA’s target of 20MW • 8xenergy efficiency

Developing energy and resource efficient algorithmsUsing self‐functionalities to adapt and scale‐out LiGen application

LiGen: Exascale‐ready HPC application Tangible Chemical Space: 300 Bio

Peptide library: 120 Mio

Commercial libraries: 6 Mio

SafeInMan library: 9 K

Exascale HPC Virtual ScreeningExascale HPC Virtual Screening

LiGen in ANTAREX

Large experimental campaign1Bio ligands database currently running on 250K CPUs MARCONI supercomputer at CINECA

Page 21: Autotuningand AdaptivityApproach for Energy Efficient HPC ...eethpc.net/wp-content/uploads/2018/06/ANTAREX_ISC18-FINAL.pdf · • DARPA’s target of 20MW • 8xenergy efficiency

AutoTuning Time-Dependent Routing Algorithm in HPC based

Navigation System

Page 22: Autotuningand AdaptivityApproach for Energy Efficient HPC ...eethpc.net/wp-content/uploads/2018/06/ANTAREX_ISC18-FINAL.pdf · • DARPA’s target of 20MW • 8xenergy efficiency

Smart City Challenge exploiting synergies betweenclient‐side and server‐side: • Many drivers – many routing requests to HPC system• Multiple traffic status data sources• Continuous update of traffic flow calculation

Use Case 2: Self-adaptive Navigation System

10

HPCSygic Company develops world`s most popular navigation application & provides professional

navigation software for business solutions

Page 23: Autotuningand AdaptivityApproach for Energy Efficient HPC ...eethpc.net/wp-content/uploads/2018/06/ANTAREX_ISC18-FINAL.pdf · • DARPA’s target of 20MW • 8xenergy efficiency

Smart City Challenges • Navigate hundred thousands of drivers/cars

within city area• Avoid traffic jams• Intelligent routing – Balance routes for a city

global optimum • Accurate calculation of traffic flow – by

exploiting big data• Minimize data transfer• To serve all drivers’ requests with global best

under variable conditions• Large scale graph problems (e.g.

Betweenness centrality) with data updates• Around 12M vertex and 30M edges for ¼ Europe

Intelligent Navigation for Smart Cities

Page 24: Autotuningand AdaptivityApproach for Energy Efficient HPC ...eethpc.net/wp-content/uploads/2018/06/ANTAREX_ISC18-FINAL.pdf · • DARPA’s target of 20MW • 8xenergy efficiency

What is the time dependent routing part of a navigation system?• It is the module responsible for determining the expected travel time• Given the client-server navigation infrastructure, it is on the server side to

have accurate travel time evaluations with updated traffic information• It is implemented by using MonteCarlo Simulation (MCS) over the

probabilistic speed profile on each path segment

Probabilistic Time Dependent Routing

Travel‐Time Probability Distribution

#Samples

MCSMCSTime + Path + Speed Profiles

Large experimental campaign currently running on IT4I supercomputer

Page 25: Autotuningand AdaptivityApproach for Energy Efficient HPC ...eethpc.net/wp-content/uploads/2018/06/ANTAREX_ISC18-FINAL.pdf · • DARPA’s target of 20MW • 8xenergy efficiency

http://www.antarex-project.eu/