31
ORNL is managed by UTBattelle for the US Department of Energy Oak Ridge National Laboratory Computing and Computational Sciences Directorate Thomas Naughton, Lawrence Sorrillo, Adam Simpson and Neena Imam Oak Ridge National Laboratory Balancing Performance and Portability with Containers in HPC: An OpenSHMEM Example August 8, 2017 OpenSHMEM 2017 Workshop, Annapolis, MD, USA

Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

ORNL is managed by UT-­Battelle for the US Department of Energy

Oak Ridge National LaboratoryComputing and Computational Sciences Directorate

Thomas Naughton, Lawrence Sorrillo, Adam Simpson and Neena Imam

Oak Ridge National Laboratory

Balancing Performance and Portability with Containers in HPC: An OpenSHMEM Example

August 8, 2017

OpenSHMEM 2017 Workshop, Annapolis, MD, USA

Page 2: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Computational Research & Development Programs

UNCLASSIFIED // FOR OFFICIAL USE ONLY

Introduction

• Growing interest in methods to support user driven software customizations in an HPC environment– Leverage isolation mechanisms & container tools

• Reasons– Improve productivity of users– Improve reproducibility of computational experiments

Page 3: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Computational Research & Development Programs

UNCLASSIFIED // FOR OFFICIAL USE ONLY

Containers

• What’s a Container?– “knobs” to tailor classic UNIX fork/exec model– Method for encapsulating an execution environment– Leverages operating system (OS) isolation mechanisms • Resource namespaces & control groups (cgroups)

Example: Process gets “isolated” view of running processes (PIDs), and a control group that restricts it to 50% of CPU

• Container Runtime– Coordinate OS mechanisms– Provide streamlined (simplified) access for user– Initialization of “container” & interfaces to attach/interact

Page 4: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Computational Research & Development Programs

UNCLASSIFIED // FOR OFFICIAL USE ONLY

Container Motivations

• Compute mobility– Build application with software stack and carry to compute– Customize execution environment to end-­user preferences

• Reproducibility– Run experiments with the same software & data– Archive experimental setup for reuse (by self or others)

• Packaging & Distribution– Configure applications with possibly complex software dependencies using packages that are “best” for user

– Benefits User and Developer productivity

Page 5: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Computational Research & Development Programs

UNCLASSIFIED // FOR OFFICIAL USE ONLY

Challenges

• Integration into HPC ecosystem– Tools & Container Runtimes for HPC

• Accessing high performance resources– Take advantage of advanced hardware capabilities– Typically in form of device drivers / user-­level interfaces• Example: HPC NIC and associated communication libraries

• Methodology and best practices– Methods for balancing portability & performance– Establish “best practices” for building/assembling/using

Page 6: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Computational Research & Development Programs

UNCLASSIFIED // FOR OFFICIAL USE ONLY

Docker

• Container technology comprised of several parts– Engine: User interface, Storage Drivers, Network overlays,

Container Runtime– Registry: Image server (public & private)– Swarm: Machine and Control (daemon) overlay– Compose: Orchestration interface/specification

• Tools & specifications– Same specification format for all layers (consistency)– Rich set of tools for image management/specification• e.g., DockerHub, Dockerfile, Compose, etc.

https://www.docker.com

Page 7: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Computational Research & Development Programs

UNCLASSIFIED // FOR OFFICIAL USE ONLY

Docker Container Runtime

• runc & containerd– Low-­level runtime for container creation/management– Donated as reference implementation for OCI (2015)• Open Container Initiative – runtime & image specs.• Formerly libcontainer in Docker

– Used with containerd as of Docker-­1.12

containerd runc

container rootfs container.json

Image-Layer-x Image-Layer-y Image-Layer-z

Container Image

Container Instance

Page 8: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Computational Research & Development Programs

UNCLASSIFIED // FOR OFFICIAL USE ONLY

Singularity Container Runtime

• Container runtime for “compute mobility”– Code developed “from scratch” by Greg Kurtzer– Started in early 2016 (http://singularity.lbl.gov)

• Created with HPC site/use awareness– Expected practices for HPC resources (e.g., network, launch)– Expected behavior for user permissions

• Adding features to gain parity with Docker– Importing Docker images– Creating a Singularity-­Hub

Page 9: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Computational Research & Development Programs

UNCLASSIFIED // FOR OFFICIAL USE ONLY

Docker• Software stack– Several components• dockerd, swarm, registry, etc.

– Rich set of tools & capabilities

• Networking– Supports full network isolation– Supports pass-­through “host”

• Storage– Full storage abstraction

• Example: devicemapper, AUFS

Singularity• Software stack– Small code base– Follows more traditional HPC patterns• User permissions• Existing HPC launchers

• Networking– No network isolation

• Storage– No storage abstraction

Page 10: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Computational Research & Development Programs

UNCLASSIFIED // FOR OFFICIAL USE ONLY

Docker• Images– Advanced read/write capabilities (layers)• Copy-­on-­Write (COW) for “container layer”

• Image creation– Specification file (Dockerfile)

• Image sharing– DockerHub is central location for sharing images

Singularity• Images– Single image “file” (or ‘rootfs’ dir)• No COW or container layers

• Image creation– Specification file– Supports Docker image import

• Image sharing– SingularityHub emerging as basis to build/share images

Page 11: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Computational Research & Development Programs

UNCLASSIFIED // FOR OFFICIAL USE ONLY

Example: MPI Application

• MPI – Docker– Run SSH daemon in containers– Span nodes using Docker networking (“virtual networking”)– Fully abstracted from host– MPI entirely within container

• MPI – Singularity– Run SSH daemon on host– Span nodes using host network (no network isolation)– Support network at host/MPI-­RTE layer– MPI split between host/container

Page 12: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Computational Research & Development Programs

UNCLASSIFIED // FOR OFFICIAL USE ONLY

Evaluation

• Objective: Evaluate viability of using same image on “developer” system and then on “production” system– Use an existing OpenSHMEM benchmark

• Container Image– Ubuntu 17.04• Recent release & not available on production system

– Select Open MPI’s implementation of OpenSHMEM• Directly available from Ubuntu

– Select Graph500 as demonstration application• Using OpenSHMEM port

Page 13: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Computational Research & Development Programs

UNCLASSIFIED // FOR OFFICIAL USE ONLY

Adapting Images

• Two general approachesa) Customize image with requisite softwareb) Dynamically load requisite software

• Pro/Con– Over customization may break portability– Loading at runtime may not be viable in all cases

Page 14: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Computational Research & Development Programs

UNCLASSIFIED // FOR OFFICIAL USE ONLY

Image Construction Procedure

• Create Docker image for Graph500– Useful for initial testing on development machine– Docker not available on production systems

• Create Singularity image for Graph500– Directly import Docker image, or– Bootstrap image from Singularity definition file

• Customize for production – Later we found we also had to add a few directories for bind mounts on production machine

– Add few changes to environment variable via the “/environment” file in Singularity-­2.2.1

– Add ‘munge’ library for authentication

Page 15: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Computational Research & Development Programs

UNCLASSIFIED // FOR OFFICIAL USE ONLY

Cray Titan Image Customizations

• (Required) Directories for runtime bind mounts– /opt/cray– /var/spool/alps– /var/opt/cray– /lustre/atlas– /lustre/atlas1– /lustre/atlas2

• Other customizations for our tests– apt-­get install libmunge2 munge– mkdir -­p /ccs/home/$MY_TITAN_USERNAME– Edits to “/environment” file (next slide)

Page 16: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Computational Research & Development Programs

UNCLASSIFIED // FOR OFFICIAL USE ONLY

Singularity /environment# Appended to /environment file in container image

# On Cray, extend LD_LIBRARY_PATH with host CRAY Libs if test -­n "$CRAY_LD_LIBRARY_PATH";; then

export PATH=$PATH:/usr/local/cuda/bin# Add Cray specific library pathsexport LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:/opt/cray/sysutils/1.0-­

1.0502.60492.1.1.gem/lib64:/opt/cray/wlm_detect/1.0-­1.0502.64649.2.2.gem/lib64:/usr/local/lib:/lib64/usr/lib/x86_64-­linux-­gnu:/usr/local/cuda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATHfi

# On Cray, Also add the host OMPI Libsif test -­n "$CRAY_OMPI_LD_LIBRARY_PATH";; then

# Add OpenMPI/2.0.2 librariesexport LD_LIBRARY_PATH=$CRAY_OMPI_LD_LIBRARY_PATH:$LD_LIBRARY_PATH# Apparently these are needed on TitanOMPI_MCA_mpi_leave_pinned=0OMPI_MCA_mpi_leave_pinned_pipeline=0

fi

Page 17: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Computational Research & Development Programs

UNCLASSIFIED // FOR OFFICIAL USE ONLY

Setup

• Software– Graph500-­OSHMEM• D’Azevedo 2015 [1]

– Singularity 2.2.1– Ubuntu 17.04• OpenMPI v2.0.2with OpenSHMEM enabled(Using oshmem: spml/yoda)

• GCC 5.4.3– Cray Linux Environment (CLE) 5.2 (Linux 3.0.x)

• Testbed Hardware– 4 Node, 64 core testbed– 10GigE

• Production Hardware– Titan @ OLCF• 16 cores per node• Used 2-­64 nodes in tests

– Gemini network

[1] E. D'Azevedo and N. Imam, Graph 500 in OpenSHMEM, OpenSHMEM Workshop 2015, http://www.csm.ornl.gov/workshops/openshmem2015/documents/talk5_paper_graph500.pdf

Page 18: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Computational Research & Development Programs

UNCLASSIFIED // FOR OFFICIAL USE ONLY

Test Cases

1. “Native” – a non-­container case• Establish baseline for Graph500 without container stuff

2. “Singularity” – container setup to use Host network• Leverage host communication libraries (Cray Gemini)• Inject host comm. libs into container via LD_LIBRARY_PATH

3. “Singularity-­NoHost” – #2 minus host libs• Run standard container (self-­contained comm. libs)• Not likely to have super high performance comm. libs as the containers are built on commodity machines– i.e., no Cray Gemini headers/libs in container

Page 19: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Computational Research & Development Programs

UNCLASSIFIED // FOR OFFICIAL USE ONLY

Example Invocations

IMAGE_FILE=$PROJ_DIR/naughton/images-­singularity/graph500-­oshmem.imgEXE=/benchmarks/src/graph500-­oshmem/mpi/graph500_shmem_one_sided

oshrun -­np 128 -­-­map-­by ppr:2:node -­-­bind-­to core \singularity exec $IMAGE_FILE $EXE 20 16

EXE=$PROJ_DIR/naughton/graph500/mpi/graph500_shmem_one_sided

oshrun -­np 128 -­-­map-­by ppr:2:node -­-­bind-­to core \$EXE 20 16

Native (no-­container)

Singularity (container)

Page 20: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Computational Research & Development Programs

UNCLASSIFIED // FOR OFFICIAL USE ONLY

Development Cluster – Graph500 BFS

• Roughly same performance– Native (no-­container) & Singularity– 2 hosts @ scale=20

11.99.0

31.7

13.5

5.6

10.88.6

29.8

13.4

4.9

0

5

10

15

20

25

30

35

40

1,(1ppr) 2,(1ppr) 4,(2ppr) 8,(4ppr) 16,(8ppr)

Time,(sec)

Num,Processes(ProcessPerHost)

UB4,2Hosts,? Graph500,BFS,mean_time(scale:,20,,edges:,16,,skip?valiate)

ub4?native ub4?singularity

Page 21: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Computational Research & Development Programs

UNCLASSIFIED // FOR OFFICIAL USE ONLY

Titan – Graph500 BFS

• Roughly same performance– Native (no-­container) & Singularity– Left: 2 hosts @ scale=16 with uGNI BTL– Right: 16 hosts & 64 hosts @ scale=20 with uGNI BTL

17.623.2

9.0

17.3

23.9

8.7

051015202530

64,(16hosts,,4ppr) 64,(64hosts,,1ppr) 128,(64hosts,,2ppr)

Time,(sec)

Num,Processes(NumHosts,,ProcessesPerHost)

Titan,A Graph500,BFS,mean_time(scale:20,,edges:16,,skipAvalidate,,uGNI)

titanAnative titanAsingularity

14.5

10.28.1

4.93.1

14.0

10.07.8

4.73.0

0246810121416

1,(2hosts,,1ppr) 2,(2hosts,,1ppr) 4,(2hosts,,2ppr) 8,(2hosts,,4ppr) 16,(2hosts,,8ppr)

Time,(sec)

Num,Processes(NumHosts,,ProcessesPerHost)

Titan,A Graph500,BFS,mean_time(scale:16,,edges:16,,skipAvalidate,,uGNI)

titanAnative titanAsingularity

Page 22: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Computational Research & Development Programs

UNCLASSIFIED // FOR OFFICIAL USE ONLY

Titan – Graph Construction & Generation

• Graph construction & generation times roughly same– Native (no-­container), Singularity, and Singularity-­noHost– 64 hosts @ scale=20 with uGNI BTL

22.525.8

22.525.5

22.525.4

051015202530

64*(64hosts,*1ppr) 128*(64hosts,*2ppr)

Time*(sec)

Num*Processes(NumHosts,*ProcessPerHost)

Titan*? Graph500*graph_generation(scale:20,*edges:16,*skip?validate,*uGNI)

titan?native titan?singularity titan?singularityNoHost

2.62.9

2.62.93.0 3.0

0

1

2

3

4

64)(64hosts,)1ppr) 128)(64hosts,)2ppr)

Time)(sec)

Num)Processes(NumHosts,)ProcessPerHost)

Titan)? Graph500)construction_time(scale:20,)edges:16,)skip?validate,)uGNI)

titan?native titan?singularity titan?singularityNoHost

Page 23: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Computational Research & Development Programs

UNCLASSIFIED // FOR OFFICIAL USE ONLY

Titan – Graph500 BFS mean_time

• Much worse performance when using host library (uGNI)– Native (no-­container), Singularity (with uGNI), Singularity-­noHost– 64 hosts @ scale=20 with uGNI BTL

23.2

9.0

23.9

8.7

1.1 1.105

1015202530

64,(64hosts,,1ppr) 128,(64hosts,,2ppr)

Time,(sec)

Num,Processes(NumHosts,,ProcessesPerHost)

Titan,A Graph500,BFS,mean_time(scale:20,,edges:, 16,,skipAvalidate,,uGNI)

titanAnative titanAsingularity titanAsingularityNoHost

Page 24: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Computational Research & Development Programs

UNCLASSIFIED // FOR OFFICIAL USE ONLY

Titan – Native comparison

• OMPI-­oshmem uGNI (spml/yoda) worse than CraySHMEM– All native (no containers)– Suggests something wrong with our OMPI-­oshmem uGNI setup ???

23.2

9.06.9

4.0

0

5

10

15

20

25

64*(64hosts,*1ppr) 128*(64hosts,*2ppr)

Time*(sec)

Num*Processes(NumHosts,*ProcessPerHost)

Titan*@ Graph500**BFS*mean_time(scale:20,*edges:16,*skip@validate)

titan@native crayshmem@native

Page 25: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Computational Research & Development Programs

UNCLASSIFIED // FOR OFFICIAL USE ONLY

Titan – Graph500 BFS mean_time

• Roughly same when disable uGNI BTL– Native (no uGNI), Singularity (no uGNI), Singularity-­noHost– 64 hosts @ scale=20 without uGNI BTL

1.10.9

1.00.9

1.1 1.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

64)(64hosts,)1ppr) 128)(64hosts,)2ppr)

Time)(sec)

Num.)Processes(NumHosts,) ProcessPerHost)

Titan)> Graph500)BFS)mean_time(scale:20,)edges:16,)skip>validate,)TCP/"no(uGNI")

titan>native>NOugni titan>singularity>NOugni titan>singularityNoHost>NOugni

Page 26: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Computational Research & Development Programs

UNCLASSIFIED // FOR OFFICIAL USE ONLY

Observations (1)• Graph500– Native & Singularity had roughly same performance

• Both uGNI & TCP BTLs

– Singularity-­NoHost had better performance in our tests due to problems with OMPI-­2.0.2’s OSHMEM with uGNI

• Open MPI’s (v2.0.2) OpenSHMEM– Good: Maybe the only OSHMEM included in a Linux distro– Good: General TCP BTL was stable for testing and showed decent performance in our Graph500 tests at scale=20 on 64 nodes

– Bad: Cray Gemini (uGNI) BTL is not stable with OSHMEM interface (using MCA spml/yoda)

Page 27: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Computational Research & Development Programs

UNCLASSIFIED // FOR OFFICIAL USE ONLY

Observations (2)• Singularity in production– Performance was consistent between native & singularity– Required to customize image with dirs for bind mounts– Inconvenient to push full image for all edits to the image• Example: change ‘/environment’ file require full re-­upload

• Note: Singularity-­2.3 may have improved this, e.g., not need ‘sudo’ for “copy” and can set env via ‘SINGULARITYENV_xxx’.

– User-­defined bind mounts disabled on older CLE kernel– Not able to use Cray SHMEM for container case• Note: Can not create image on system, and defeats purpose of portable container (not run on devel cluster)

Page 28: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Computational Research & Development Programs

UNCLASSIFIED // FOR OFFICIAL USE ONLY

Future Work

• Run with revised OpenSHMEM configuration– Use different component in Open MPI’s SPML framework• Disable Yoda (spml/yoda)• Enable UCX (spml/ucx)

– Determine root cause of unexpected performance with uGNI

• Scale-­up tests on production system– Perform larger node/core count MPI and OSHMEM tests on Titan

Page 29: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Computational Research & Development Programs

UNCLASSIFIED // FOR OFFICIAL USE ONLY

Summary

• Containers in HPC– Motivations

• Productivity of users• Reproducibility of experiments

– Overview of two key container runtimes• Singularity & Docker

• Evaluation on Titan– Graph500 benchmark to investigate performance of OpenSHMEMapplication with Singularity based containers on production machine

– Identified basic set of image edits for use on Titan– Consistent performance between Native and Singularity

• Note: Identified unexpected slowness (unexplained) when using uGNI BTL

Page 30: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Computational Research & Development Programs

UNCLASSIFIED // FOR OFFICIAL USE ONLY

This work was supported by the United StatesDepartment of Defense (DoD) and used resourcesof the Computational Research and DevelopmentPrograms at Oak Ridge National Laboratory.

Acknowledgements

Page 31: Balancing Performance and Portability with Containers in ... · ORNL%ismanaged%byUT2Battelle% fortheUSDepartmentofEnergy Oak Ridge National Laboratory Computing and Computational

Questions?

Computational Research & Development Programs