43
Revisiting Co-Scheduling for Upcoming ExaScale Systems Plenary talk at HPCS 2015 Stefan Lankes

Revisiting Co-Scheduling for Upcoming ExaScale Systems

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Revisiting Co-Scheduling for Upcoming ExaScale SystemsPlenary talk at HPCS 2015

Stefan Lankes

Page 2: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Agenda

Motivation

Migration in HPC EnvironmentsProcess-levelVirtualizationContainer-based

Evaluation of Migration TechniquesOverheadMigration Time

Co-Scheduling

Conclusion

2 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 3: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Motivation

Page 4: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Find a suitable Topology for Exascale Applications

4 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 5: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Find a suitable Topology for Exascale Applications

Funded by the Federal Ministry of Education and Research, Germany, 2014–2016Project partners

Johannes Gutenberg-Universität Mainz, Zentrum für Datenverarbeitung (JGU),KoordinatorTechnische Universität München, Lehrstuhl für Rechnertechnik und Rechnerorganisation(TUM)Universität zu Köln, Regionales Rechenzentrum (RRZK)Fraunhofer-Institut für Algorithmen und wissenschaftliches Rechnen (SCAI)RWTH Aachen University, Institute for Automation of Complex Power Systems (ACS),E.ON Energy Research CenterMEGWARE Computer Vertrieb und Service GmbHParTec Cluster Competence Center GmbH

5 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 6: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Exascale Systems

Characteristics/AssumptionsIncreasing amount of cores per nodeIncrease of CPU performance will not be matched by I/O performance

→ Most applications cannot exploit this parallelism

ConsequencesExclusive node assignment has to be revoked (Co-Scheduling)Dynamic scheduling during runtime will be necessary

→ Migration of processes between nodes indispensable

6 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 7: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Find a Suitable Topology for Exascale Applications

A twofold scheduling approach1. Initial placement of job by a global scheduler

according to KPIsLoad of the individual nodesPower consumptionApplications’ resource requirements

2. Dynamic runtime adjustmentsMigration of processesTight coupling with the applicationsFeedback to the global scheduler

Cent

ralS

ched

uler

...

Agen

tAg

ent

Agen

t

App 1

App 2

App 1

App 1

App 2

App 2

7 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 8: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Find a Suitable Topology for Exascale Applications

Project partner – RWTH Aachen University

Main task: Process migration for load balancingClose interaction between

Migration,Scheduling,Application

Applications know their performance characteristic and resource requirements

8 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 9: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Migration in HPC Environments

Page 10: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Process-level Migration

Process

OS

Host A

Process

OS

Host B

ProcessProcessProcess

IB Fabric

10 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 11: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Process-level Migration

Move a process from one node to anotherIncluding its execution context(i. e., register state and physical memory)

A special kind of Checkpoint/Restart (C/R) operationSeveral frameworks available

Condor’s checkpoint librarylibckptBerkley Lab Checkpoint/Restart (BLCR)Distributed MultiThreaded Checkpointing (DMTCP)

11 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 12: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Berkley Lab Checkpoint/Restart

Specifically designed for HPC applicationsTwo components

Kernel module for performing the C/R operationShared-library enabling user-space access

Cooperation with the C/R procedure via callback interfaceClose/open file descriptorsTear down/reestablish communication channels. . .

→ Residual dependencies that have to be resolved!

12 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 13: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Virtual Machine Migration

Process

VM

Hypervisor

Host A

Process

VM

Hypervisor

Host B

Process

VMProcess

VM

Process

VM

IB Fabric

13 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 14: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Virtual Machine Migration

Migration of a virtualized execution environmentReduction of the residual dependencies

File descriptors still valid after the migrationCommunication channels are automatically restored (e. g., TCP)

Performance degradation of I/O devices can be compensated byPass-through (e. g., Intel VT-d)Single Root I/O virtualization (SR-IOV)

Various hypervisors availableXenKernel Based Virtual Machine (KVM). . .

14 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 15: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Kernel Based Virtual Machine

A Linux kernel module that benefits from existing resourcesSchedulerMemory management. . .

Implements full-virtualization requiring hardware supportVMs are scheduled by the host like any other processAlready provides a migration framework

Live-migrationCold-migration

15 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 16: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Software-based Sharing

Hardware Physical NIC

Hypervisor Software Virtual Switch

eth0VMeth0VM

16 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 17: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Pass-Through

Hardware Physical NIC

Hypervisor

eth0VMeth0VM

17 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 18: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Single Root I/O Virtualization

Hardware VFVF PF

Hypervisor

eth0VMeth0VM

18 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 19: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Container-based Virtualization/

Hardware

Host Kernel

Hypervisor

GuestKernel

GuestKernel

App App App

(a) Virtual Machines

Hardware

Shared Kernel

Virtualization Layer

App AppApp

(b) Containers

19 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 20: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Container-based Virtualization/

Hardware

Host Kernel

Hypervisor

GuestKernel

GuestKernel

App App App

(a) Virtual Machines

Hardware

Shared Kernel

Virtualization Layer

App AppApp

(b) Containers

19 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 21: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Container-based Migration

Full-virtualization results in multiple kernels on one nodeIdea of container-based virtualization

→ Reuse the existing host kernel for the management of multiple user-space instancesHost and guest have to use the same operating systemCommon representatives

OpenVZLinuX Containers (LXC)

20 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 22: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Evaluation of Migration Techniques

Page 23: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Test Environment

4-node Cluster2 Sandy-bridge Systems2 Ivy-bridge Systems

InfiniBand FDR Mellanox FabricUp to 56 GiB/sSupport for SR-IOV

OpenMPI 1.7 with BLCR support (except for the LXC results)

22 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 24: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Throughput

128 1 Ki 8 Ki 65 Ki 512 Ki 4 Mi0

2,000

4,000

6,000

Size in Byte

Thr

ough

put

inM

iB/s

ib_write_bw (Ivy-Bridge)

NativePass-ThroughSR-IOV

23 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 25: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Throughput

128 1 Ki 8 Ki 65 Ki 512 Ki 4 Mi 32 Mi0

2,000

4,000

6,000

Size in Byte

Thr

ough

put

inM

iB/s

OpenMPI (GCC 4.4.7, Ivy-Bridge)

NativePass-ThroughSR-IOVBLCR

24 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 26: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Latency

Native Pass-Through SR-IOV BLCR

InfiniBand Open MPI0

0.5

1

1.5La

tenc

yin

µs

(RT

T/2

)

25 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 27: Revisiting Co-Scheduling for Upcoming ExaScale Systems

NAS Parallel Benchmarks (Open MPI)

Native BLCR KVM

FT BT LU0

10

20

30Sp

eed

inG

Flop

s/s

26 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 28: Revisiting Co-Scheduling for Upcoming ExaScale Systems

NAS Parallel Benchmarks (Parastation)

Native LXC KVM

FT BT LU0

10

20

30Sp

eed

inG

Flop

s/s

27 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 29: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Migration Time

BLCR KVM

Overall Downtime0

0.5

1

1.5

2

2.5

Tim

ein

s

28 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 30: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Co-Scheduling

Page 31: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Find a Suitable Topology for Exascale Applications

Project partner – Technischen Universität München (TUM)Detailed analysis of reference applications

LAMA (shared memory, memory bound) und BLAST (message passing, compute bound)are analyzed with existing and for the project developed tools.LAMA and BLAST are accelerated up to a factor of two

In cooperation with JGU and RWTH, integration into the agent systemFurther development of autopin+

30 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 32: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Scalability without Co-Scheduling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 310

50

100

150

200

250

300

350

400

CG solverMPIBlast

#threads/processes

seconds

31 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 33: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Enegry / Power Demand without Co-Scheduling

MPIBlast

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 310

50

100

150

200

250

300

350

400

0

10000

20000

30000

40000

50000

60000

70000

80000

Remainder (J)

Cores (J)

Uncore (J)

RAM (J)

RAM (W)

Cores (W)

Package (W)

Clustsafe (W)

#MPIBlast processes

Watt Joule

32 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 34: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Enegry / Power Demand without Co-Scheduling

LAMA

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 310

50

100

150

200

250

300

350

400

450

0

5000

10000

15000

20000

25000

Remainder (J)

Cores (J)

Uncore (J)

RAM (J)

RAM (W)

Cores (W)

Package (W)

Clustsafe (W)

#CG solver threads

Watt Joule

33 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 35: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Run time / Efficiency with Co-Scheduling

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 300

100

200

300

400

500

600

700

0

0,2

0,4

0,6

0,8

1

1,2

CG solver

MPIBlast

CG solver

MPIBlast

#MPIBlast processes

seconds efficiency

34 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 36: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Conclusion

Page 37: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Conclusion and Outlook – Co-Scheduling

Co-Scheduling quite effective in the case examinedImprovements over sequential execution

Up to 12 % improvement in energy consumption (in joules).Up to 28 % improvement in run time.

Next steps: classifying application characteristics to find suitable candidates forco-scheduling

Survey started at german compute centers (LRZ, RZG, RRZE, RWTH, . . . )

36 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 38: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Conclusion and Outlook – Virtualization

Inconclusive microbenchmarks analysisOverhead is highly application dependentMigration

Significant overhead of KVM concerning overall migration timeQualified by the downtime testKVM already supports live-migrationBLCR requires all processes to be stopped during migration

FlexibilityProcess-level migration generates residual dependencies

→ A non-transparent approach would be requiredVM/Container-based migration reduces these dependencies

37 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 39: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Conclusion and Outlook – Virtualization

Firstly, we will focus on VM migration within FASTContainers may provide even better performance

Not as flexible as VMs (e. g., same OS)More isolation than process-level migration

Investigation of the application dependency observed during our studiesEnable VM migration with attached pass-through devices

IB migration nearly finalized

→ More about FAST on http://www.en.fast-project.de

38 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 40: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Thanksgiving

The results1 were obtained in collaboration with the following persons:Carsten Trinitis (TUM)Josef Weidendorfer (TUM)Jens Breitbart (TUM)André Brinkmann (JGU)Ramy Gad (JGU)Lars Nagel (JGU)Tim Süß (JGU)Simon Pickartz (RWTH)Carsten Clauss (ParTec)Stephan Krempel (ParTec)Robert Hommel (Megware)

1Jens Breitbart, Carsten Trinitis, and Josef Weidendorfer. “Case Study on Co-Scheduling for HPCApplications”. In: Proceedings of the International Workshop on Scheduling and Resource Management forParallel and Distributed Systems (SRMPDS 2015). Beijing, China, 2015, Simon Pickartz et al. “MigrationTechniques in HPC Environments”. English. In: Euro-Par 2014: Parallel Processing Workshops. Vol. 8806.Lecture Notes in Computer Science. Springer International Publishing, 2014, pp. 486–497. doi:10.1007/978-3-319-14313-2_41.39 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |

21/07/2015

Page 41: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Power Demand with Co-Scheduling

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 300

50

100

150

200

250

300

350

400

450

500

Package

Cores

RAM

Clustsafe

#MPIBlast processes

Watt

40 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 42: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Power Demand with Co-Scheduling

3 4 5 6 7 8 9 10 11 12 13 14 150

100

200

300

400

500

600

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

CG solver

MPIBlast

CG solver

MPIBlast

#MPIBlast processes

seconds efficiency

41 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Page 43: Revisiting Co-Scheduling for Upcoming ExaScale Systems

Thank you for your kind attention!

Stefan Lankes – [email protected]

Institute for Automation of Complex Power SystemsE.ON Energy Research Center, RWTH Aachen UniversityMathieustraße 1052074 AachenGermany

www.acs.eonerc.rwth-aachen.de