Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Revisiting Co-Scheduling for Upcoming ExaScale SystemsPlenary talk at HPCS 2015
Stefan Lankes
Agenda
Motivation
Migration in HPC EnvironmentsProcess-levelVirtualizationContainer-based
Evaluation of Migration TechniquesOverheadMigration Time
Co-Scheduling
Conclusion
2 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Motivation
Find a suitable Topology for Exascale Applications
4 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Find a suitable Topology for Exascale Applications
Funded by the Federal Ministry of Education and Research, Germany, 2014–2016Project partners
Johannes Gutenberg-Universität Mainz, Zentrum für Datenverarbeitung (JGU),KoordinatorTechnische Universität München, Lehrstuhl für Rechnertechnik und Rechnerorganisation(TUM)Universität zu Köln, Regionales Rechenzentrum (RRZK)Fraunhofer-Institut für Algorithmen und wissenschaftliches Rechnen (SCAI)RWTH Aachen University, Institute for Automation of Complex Power Systems (ACS),E.ON Energy Research CenterMEGWARE Computer Vertrieb und Service GmbHParTec Cluster Competence Center GmbH
5 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Exascale Systems
Characteristics/AssumptionsIncreasing amount of cores per nodeIncrease of CPU performance will not be matched by I/O performance
→ Most applications cannot exploit this parallelism
ConsequencesExclusive node assignment has to be revoked (Co-Scheduling)Dynamic scheduling during runtime will be necessary
→ Migration of processes between nodes indispensable
6 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Find a Suitable Topology for Exascale Applications
A twofold scheduling approach1. Initial placement of job by a global scheduler
according to KPIsLoad of the individual nodesPower consumptionApplications’ resource requirements
2. Dynamic runtime adjustmentsMigration of processesTight coupling with the applicationsFeedback to the global scheduler
Cent
ralS
ched
uler
...
Agen
tAg
ent
Agen
t
App 1
App 2
App 1
App 1
App 2
App 2
7 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Find a Suitable Topology for Exascale Applications
Project partner – RWTH Aachen University
Main task: Process migration for load balancingClose interaction between
Migration,Scheduling,Application
Applications know their performance characteristic and resource requirements
8 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Migration in HPC Environments
Process-level Migration
Process
OS
Host A
Process
OS
Host B
ProcessProcessProcess
IB Fabric
10 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Process-level Migration
Move a process from one node to anotherIncluding its execution context(i. e., register state and physical memory)
A special kind of Checkpoint/Restart (C/R) operationSeveral frameworks available
Condor’s checkpoint librarylibckptBerkley Lab Checkpoint/Restart (BLCR)Distributed MultiThreaded Checkpointing (DMTCP)
11 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Berkley Lab Checkpoint/Restart
Specifically designed for HPC applicationsTwo components
Kernel module for performing the C/R operationShared-library enabling user-space access
Cooperation with the C/R procedure via callback interfaceClose/open file descriptorsTear down/reestablish communication channels. . .
→ Residual dependencies that have to be resolved!
12 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Virtual Machine Migration
Process
VM
Hypervisor
Host A
Process
VM
Hypervisor
Host B
Process
VMProcess
VM
Process
VM
IB Fabric
13 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Virtual Machine Migration
Migration of a virtualized execution environmentReduction of the residual dependencies
File descriptors still valid after the migrationCommunication channels are automatically restored (e. g., TCP)
Performance degradation of I/O devices can be compensated byPass-through (e. g., Intel VT-d)Single Root I/O virtualization (SR-IOV)
Various hypervisors availableXenKernel Based Virtual Machine (KVM). . .
14 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Kernel Based Virtual Machine
A Linux kernel module that benefits from existing resourcesSchedulerMemory management. . .
Implements full-virtualization requiring hardware supportVMs are scheduled by the host like any other processAlready provides a migration framework
Live-migrationCold-migration
15 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Software-based Sharing
Hardware Physical NIC
Hypervisor Software Virtual Switch
eth0VMeth0VM
16 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Pass-Through
Hardware Physical NIC
Hypervisor
eth0VMeth0VM
17 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Single Root I/O Virtualization
Hardware VFVF PF
Hypervisor
eth0VMeth0VM
18 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Container-based Virtualization/
Hardware
Host Kernel
Hypervisor
GuestKernel
GuestKernel
App App App
(a) Virtual Machines
Hardware
Shared Kernel
Virtualization Layer
App AppApp
(b) Containers
19 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Container-based Virtualization/
Hardware
Host Kernel
Hypervisor
GuestKernel
GuestKernel
App App App
(a) Virtual Machines
Hardware
Shared Kernel
Virtualization Layer
App AppApp
(b) Containers
19 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Container-based Migration
Full-virtualization results in multiple kernels on one nodeIdea of container-based virtualization
→ Reuse the existing host kernel for the management of multiple user-space instancesHost and guest have to use the same operating systemCommon representatives
OpenVZLinuX Containers (LXC)
20 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Evaluation of Migration Techniques
Test Environment
4-node Cluster2 Sandy-bridge Systems2 Ivy-bridge Systems
InfiniBand FDR Mellanox FabricUp to 56 GiB/sSupport for SR-IOV
OpenMPI 1.7 with BLCR support (except for the LXC results)
22 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Throughput
128 1 Ki 8 Ki 65 Ki 512 Ki 4 Mi0
2,000
4,000
6,000
Size in Byte
Thr
ough
put
inM
iB/s
ib_write_bw (Ivy-Bridge)
NativePass-ThroughSR-IOV
23 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Throughput
128 1 Ki 8 Ki 65 Ki 512 Ki 4 Mi 32 Mi0
2,000
4,000
6,000
Size in Byte
Thr
ough
put
inM
iB/s
OpenMPI (GCC 4.4.7, Ivy-Bridge)
NativePass-ThroughSR-IOVBLCR
24 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Latency
Native Pass-Through SR-IOV BLCR
InfiniBand Open MPI0
0.5
1
1.5La
tenc
yin
µs
(RT
T/2
)
25 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
NAS Parallel Benchmarks (Open MPI)
Native BLCR KVM
FT BT LU0
10
20
30Sp
eed
inG
Flop
s/s
26 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
NAS Parallel Benchmarks (Parastation)
Native LXC KVM
FT BT LU0
10
20
30Sp
eed
inG
Flop
s/s
27 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Migration Time
BLCR KVM
Overall Downtime0
0.5
1
1.5
2
2.5
Tim
ein
s
28 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Co-Scheduling
Find a Suitable Topology for Exascale Applications
Project partner – Technischen Universität München (TUM)Detailed analysis of reference applications
LAMA (shared memory, memory bound) und BLAST (message passing, compute bound)are analyzed with existing and for the project developed tools.LAMA and BLAST are accelerated up to a factor of two
In cooperation with JGU and RWTH, integration into the agent systemFurther development of autopin+
30 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Scalability without Co-Scheduling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 310
50
100
150
200
250
300
350
400
CG solverMPIBlast
#threads/processes
seconds
31 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Enegry / Power Demand without Co-Scheduling
MPIBlast
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 310
50
100
150
200
250
300
350
400
0
10000
20000
30000
40000
50000
60000
70000
80000
Remainder (J)
Cores (J)
Uncore (J)
RAM (J)
RAM (W)
Cores (W)
Package (W)
Clustsafe (W)
#MPIBlast processes
Watt Joule
32 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Enegry / Power Demand without Co-Scheduling
LAMA
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 310
50
100
150
200
250
300
350
400
450
0
5000
10000
15000
20000
25000
Remainder (J)
Cores (J)
Uncore (J)
RAM (J)
RAM (W)
Cores (W)
Package (W)
Clustsafe (W)
#CG solver threads
Watt Joule
33 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Run time / Efficiency with Co-Scheduling
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 300
100
200
300
400
500
600
700
0
0,2
0,4
0,6
0,8
1
1,2
CG solver
MPIBlast
CG solver
MPIBlast
#MPIBlast processes
seconds efficiency
34 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Conclusion
Conclusion and Outlook – Co-Scheduling
Co-Scheduling quite effective in the case examinedImprovements over sequential execution
Up to 12 % improvement in energy consumption (in joules).Up to 28 % improvement in run time.
Next steps: classifying application characteristics to find suitable candidates forco-scheduling
Survey started at german compute centers (LRZ, RZG, RRZE, RWTH, . . . )
36 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Conclusion and Outlook – Virtualization
Inconclusive microbenchmarks analysisOverhead is highly application dependentMigration
Significant overhead of KVM concerning overall migration timeQualified by the downtime testKVM already supports live-migrationBLCR requires all processes to be stopped during migration
FlexibilityProcess-level migration generates residual dependencies
→ A non-transparent approach would be requiredVM/Container-based migration reduces these dependencies
37 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Conclusion and Outlook – Virtualization
Firstly, we will focus on VM migration within FASTContainers may provide even better performance
Not as flexible as VMs (e. g., same OS)More isolation than process-level migration
Investigation of the application dependency observed during our studiesEnable VM migration with attached pass-through devices
IB migration nearly finalized
→ More about FAST on http://www.en.fast-project.de
38 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Thanksgiving
The results1 were obtained in collaboration with the following persons:Carsten Trinitis (TUM)Josef Weidendorfer (TUM)Jens Breitbart (TUM)André Brinkmann (JGU)Ramy Gad (JGU)Lars Nagel (JGU)Tim Süß (JGU)Simon Pickartz (RWTH)Carsten Clauss (ParTec)Stephan Krempel (ParTec)Robert Hommel (Megware)
1Jens Breitbart, Carsten Trinitis, and Josef Weidendorfer. “Case Study on Co-Scheduling for HPCApplications”. In: Proceedings of the International Workshop on Scheduling and Resource Management forParallel and Distributed Systems (SRMPDS 2015). Beijing, China, 2015, Simon Pickartz et al. “MigrationTechniques in HPC Environments”. English. In: Euro-Par 2014: Parallel Processing Workshops. Vol. 8806.Lecture Notes in Computer Science. Springer International Publishing, 2014, pp. 486–497. doi:10.1007/978-3-319-14313-2_41.39 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |
21/07/2015
Power Demand with Co-Scheduling
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 300
50
100
150
200
250
300
350
400
450
500
Package
Cores
RAM
Clustsafe
#MPIBlast processes
Watt
40 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Power Demand with Co-Scheduling
3 4 5 6 7 8 9 10 11 12 13 14 150
100
200
300
400
500
600
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
CG solver
MPIBlast
CG solver
MPIBlast
#MPIBlast processes
seconds efficiency
41 of 42 Revisiting Co-Scheduling | Stefan Lankes | RWTH Aachen University |21/07/2015
Thank you for your kind attention!
Stefan Lankes – [email protected]
Institute for Automation of Complex Power SystemsE.ON Energy Research Center, RWTH Aachen UniversityMathieustraße 1052074 AachenGermany
www.acs.eonerc.rwth-aachen.de