Revisiting Co-Scheduling for Upcoming ExaScale Systems

Revisiting Co-Scheduling for Upcoming ExaScale SystemsPlenary talk at HPCS 2015

Stefan Lankes

Agenda

Motivation

Migration in HPC EnvironmentsProcess-levelVirtualizationContainer-based

Evaluation of Migration TechniquesOverheadMigration Time

Co-Scheduling

Conclusion

2 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015

Motivation

Find a suitable Topology for Exascale Applications


Find a suitable Topology for Exascale Applications

Funded by the Federal Ministry of Education and Research, Germany, 2014–2016Project partners

Johannes Gutenberg-Universität Mainz, Zentrum für Datenverarbeitung (JGU),KoordinatorTechnische Universität München, Lehrstuhl für Rechnertechnik und Rechnerorganisation(TUM)Universität zu Köln, Regionales Rechenzentrum (RRZK)Fraunhofer-Institut für Algorithmen und wissenschaftliches Rechnen (SCAI)RWTH Aachen University, Institute for Automation of Complex Power Systems (ACS),E.ON Energy Research CenterMEGWARE Computer Vertrieb und Service GmbHParTec Cluster Competence Center GmbH


Exascale Systems

Characteristics/AssumptionsIncreasing amount of cores per nodeIncrease of CPU performance will not be matched by I/O performance

→ Most applications cannot exploit this parallelism

ConsequencesExclusive node assignment has to be revoked (Co-Scheduling)Dynamic scheduling during runtime will be necessary

→ Migration of processes between nodes indispensable


Find a Suitable Topology for Exascale Applications

A twofold scheduling approach1. Initial placement of job by a global scheduler

according to KPIsLoad of the individual nodesPower consumptionApplications’ resource requirements

2. Dynamic runtime adjustmentsMigration of processesTight coupling with the applicationsFeedback to the global scheduler

Cent

ralS

ched

uler

...

Agen

tAg

ent

Agen

t

App 1

App 2

App 1

App 1

App 2

App 2



Project partner – RWTH Aachen University

Main task: Process migration for load balancingClose interaction between

Migration,Scheduling,Application

Applications know their performance characteristic and resource requirements


Migration in HPC Environments

Process-level Migration

Process

OS

Host A

Process

OS

Host B

ProcessProcessProcess

IB Fabric


Process-level Migration

Move a process from one node to anotherIncluding its execution context(i. e., register state and physical memory)

A special kind of Checkpoint/Restart (C/R) operationSeveral frameworks available

Condor’s checkpoint librarylibckptBerkley Lab Checkpoint/Restart (BLCR)Distributed MultiThreaded Checkpointing (DMTCP)


Berkley Lab Checkpoint/Restart

Specifically designed for HPC applicationsTwo components

Kernel module for performing the C/R operationShared-library enabling user-space access

Cooperation with the C/R procedure via callback interfaceClose/open file descriptorsTear down/reestablish communication channels. . .

→ Residual dependencies that have to be resolved!


Virtual Machine Migration

Process

VM

Hypervisor

Host A

Process

VM

Hypervisor

Host B

Process

VMProcess

VM

Process

VM

IB Fabric


Virtual Machine Migration

Migration of a virtualized execution environmentReduction of the residual dependencies

File descriptors still valid after the migrationCommunication channels are automatically restored (e. g., TCP)

Performance degradation of I/O devices can be compensated byPass-through (e. g., Intel VT-d)Single Root I/O virtualization (SR-IOV)

Various hypervisors availableXenKernel Based Virtual Machine (KVM). . .


Kernel Based Virtual Machine

A Linux kernel module that benefits from existing resourcesSchedulerMemory management. . .

Implements full-virtualization requiring hardware supportVMs are scheduled by the host like any other processAlready provides a migration framework

Live-migrationCold-migration


Software-based Sharing

Hardware Physical NIC

Hypervisor Software Virtual Switch

eth0VMeth0VM


Pass-Through

Hardware Physical NIC

Hypervisor

eth0VMeth0VM


Single Root I/O Virtualization

Hardware VFVF PF

Hypervisor

eth0VMeth0VM


Container-based Virtualization/

Hardware

Host Kernel

Hypervisor

GuestKernel

GuestKernel

App App App

(a) Virtual Machines

Hardware

Shared Kernel

Virtualization Layer

App AppApp

(b) Containers


Container-based Virtualization/

Hardware

Host Kernel

Hypervisor

GuestKernel

GuestKernel

App App App

(a) Virtual Machines

Hardware

Shared Kernel

Virtualization Layer

App AppApp

(b) Containers


Container-based Migration

Full-virtualization results in multiple kernels on one nodeIdea of container-based virtualization

→ Reuse the existing host kernel for the management of multiple user-space instancesHost and guest have to use the same operating systemCommon representatives

OpenVZLinuX Containers (LXC)


Evaluation of Migration Techniques

Test Environment

4-node Cluster2 Sandy-bridge Systems2 Ivy-bridge Systems

InfiniBand FDR Mellanox FabricUp to 56 GiB/sSupport for SR-IOV

OpenMPI 1.7 with BLCR support (except for the LXC results)


Throughput

128 1 Ki 8 Ki 65 Ki 512 Ki 4 Mi0

2,000

4,000

6,000

Size in Byte

Thr

ough

put

inM

iB/s

ib_write_bw (Ivy-Bridge)

NativePass-ThroughSR-IOV


Throughput

128 1 Ki 8 Ki 65 Ki 512 Ki 4 Mi 32 Mi0

2,000

4,000

6,000

Size in Byte

Thr

ough

put

inM

iB/s

OpenMPI (GCC 4.4.7, Ivy-Bridge)

NativePass-ThroughSR-IOVBLCR


Latency

Native Pass-Through SR-IOV BLCR

InfiniBand Open MPI0

0.5

1

1.5La

tenc

yin

µs

(RT

T/2

)


NAS Parallel Benchmarks (Open MPI)

Native BLCR KVM

FT BT LU0

10

20

30Sp

eed

inG

Flop

s/s


NAS Parallel Benchmarks (Parastation)

Native LXC KVM

FT BT LU0

10

20

30Sp

eed

inG

Flop

s/s


Migration Time

BLCR KVM

Overall Downtime0

0.5

1

1.5

2

2.5

Tim

ein

s


Co-Scheduling


Project partner – Technischen Universität München (TUM)Detailed analysis of reference applications

LAMA (shared memory, memory bound) und BLAST (message passing, compute bound)are analyzed with existing and for the project developed tools.LAMA and BLAST are accelerated up to a factor of two

In cooperation with JGU and RWTH, integration into the agent systemFurther development of autopin+


Scalability without Co-Scheduling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 310

50

100

150

200

250

300

350

400

CG solverMPIBlast

#threads/processes

seconds


Enegry / Power Demand without Co-Scheduling

MPIBlast

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 310

50

100

150

200

250

300

350

400

0

10000

20000

30000

40000

50000

60000

70000

80000

Remainder (J)

Cores (J)

Uncore (J)

RAM (J)

RAM (W)

Cores (W)

Package (W)

Clustsafe (W)

#MPIBlast processes

Watt Joule


Enegry / Power Demand without Co-Scheduling

LAMA

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 310

50

100

150

200

250

300

350

400

450

0

5000

10000

15000

20000

25000

Remainder (J)

Cores (J)

Uncore (J)

RAM (J)

RAM (W)

Cores (W)

Package (W)

Clustsafe (W)

#CG solver threads

Watt Joule


Run time / Efficiency with Co-Scheduling

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 300

100

200

300

400

500

600

700

0

0,2

0,4

0,6

0,8

1

1,2

CG solver

MPIBlast

CG solver

MPIBlast

#MPIBlast processes

seconds efficiency


Conclusion

Conclusion and Outlook – Co-Scheduling

Co-Scheduling quite effective in the case examinedImprovements over sequential execution

Up to 12 % improvement in energy consumption (in joules).Up to 28 % improvement in run time.

Next steps: classifying application characteristics to find suitable candidates forco-scheduling

Survey started at german compute centers (LRZ, RZG, RRZE, RWTH, . . . )


Conclusion and Outlook – Virtualization

Inconclusive microbenchmarks analysisOverhead is highly application dependentMigration

Significant overhead of KVM concerning overall migration timeQualified by the downtime testKVM already supports live-migrationBLCR requires all processes to be stopped during migration

FlexibilityProcess-level migration generates residual dependencies

→ A non-transparent approach would be requiredVM/Container-based migration reduces these dependencies


Conclusion and Outlook – Virtualization

Firstly, we will focus on VM migration within FASTContainers may provide even better performance

Not as flexible as VMs (e. g., same OS)More isolation than process-level migration

Investigation of the application dependency observed during our studiesEnable VM migration with attached pass-through devices

IB migration nearly finalized

→ More about FAST on http://www.en.fast-project.de


http://www.en.fast-project.de

Thanksgiving

The results1 were obtained in collaboration with the following persons:Carsten Trinitis (TUM)Josef Weidendorfer (TUM)Jens Breitbart (TUM)André Brinkmann (JGU)Ramy Gad (JGU)Lars Nagel (JGU)Tim Süß (JGU)Simon Pickartz (RWTH)Carsten Clauss (ParTec)Stephan Krempel (ParTec)Robert Hommel (Megware)

1Jens Breitbart, Carsten Trinitis, and Josef Weidendorfer. “Case Study on Co-Scheduling for HPCApplications”. In: Proceedings of the International Workshop on Scheduling and Resource Management forParallel and Distributed Systems (SRMPDS 2015). Beijing, China, 2015, Simon Pickartz et al. “MigrationTechniques in HPC Environments”. English. In: Euro-Par 2014: Parallel Processing Workshops. Vol. 8806.Lecture Notes in Computer Science. Springer International Publishing, 2014, pp. 486–497. doi:10.1007/978-3-319-14313-2_41.39 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |

21/07/2015

http://dx.doi.org/10.1007/978-3-319-14313-2_41

Power Demand with Co-Scheduling

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 300

50

100

150

200

250

300

350

400

450

500

Package

Cores

RAM

Clustsafe

#MPIBlast processes

Watt


Power Demand with Co-Scheduling

3 4 5 6 7 8 9 10 11 12 13 14 150

100

200

300

400

500

600

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

CG solver

MPIBlast

CG solver

MPIBlast

#MPIBlast processes

seconds efficiency


Thank you for your kind attention!

Stefan Lankes – [email protected]

Institute for Automation of Complex Power SystemsE.ON Energy Research Center, RWTH Aachen UniversityMathieustraße 1052074 AachenGermany

www.acs.eonerc.rwth-aachen.de

mailto:[email protected]

www.acs.eonerc.rwth-aachen.de

Documents

Revisiting Co-Scheduling for Upcoming ExaScale Systems