25
LLNL-PRES-729302 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC Accounts, Access, User Environment Topics Blaise Barney Livermore Computing Development Environment Group April 19, 2017

Accounts, Access, User Environment Topics · PDF fileUser Environment Topics ... NVRAM SSD No Yes (Ray only) Job launcher srun mpirun (jsrun beta coming soon) 8 LLNL-PRES-729302

Embed Size (px)

Citation preview

Page 1: Accounts, Access, User Environment Topics · PDF fileUser Environment Topics ... NVRAM SSD No Yes (Ray only) Job launcher srun mpirun (jsrun beta coming soon) 8 LLNL-PRES-729302

LLNL-PRES-729302

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Accounts, Access, User Environment Topics

Blaise BarneyLivermore Computing

Development Environment Group

April 19, 2017

Page 2: Accounts, Access, User Environment Topics · PDF fileUser Environment Topics ... NVRAM SSD No Yes (Ray only) Job launcher srun mpirun (jsrun beta coming soon) 8 LLNL-PRES-729302

2LLNL-PRES-729302

▪ LLNL and Collaborators:— easy - just go to https://lc-idm.llnl.gov as usual— OCF: add resource Ray (CZ), RZManta (RZ)— SCF: add resource Shark

▪ LANL and Sandia:— also easy - go to sarape.sandia.gov as usual— LLNL resources: Ray, RZManta and Shark (depending on

clearance/citizenship)— Sponsor: Greg Tomaschke, [email protected], 925-423-0561

▪ PSAAP centers:— go to sarape.sandia.gov as usual— LLNL resource: Ray— Sponsor: Blaise Barney, [email protected], 925-422-2578

Accounts

Page 3: Accounts, Access, User Environment Topics · PDF fileUser Environment Topics ... NVRAM SSD No Yes (Ray only) Job launcher srun mpirun (jsrun beta coming soon) 8 LLNL-PRES-729302

3LLNL-PRES-729302

▪ Currently, everyone just gets put into a "guests" account/group.

▪ LC staff also get put into a "lcstaff" account/group.

▪ Eventually, LC will establish a real set of accounts and allocations.

▪ Expect things to behave differently though - the LSF batch system has replaced Moab/SLURM.

▪ TIP: setting this environment variable (for now) will help avoid jobs getting rejected in case you forget to specify a group:

setenv LSB_DEFAULT_USERGROUP guestsexport LSB_DEFAULT_USERGROUP=guests

Allocations

Page 4: Accounts, Access, User Environment Topics · PDF fileUser Environment Topics ... NVRAM SSD No Yes (Ray only) Job launcher srun mpirun (jsrun beta coming soon) 8 LLNL-PRES-729302

4LLNL-PRES-729302

▪ Ray (CZ):— Accessible directly from within the LLNL domain— Not currently accessible directly from outside LLNL

• Need to login to another CZ machine first, then ssh to Ray• This will change later, after required security measures are in place

▪ RZManta (RZ):— Accessible only through rzgw.llnl.gov - same as other RZ systems— LANL/Sandia: need to start from an "ihpc" node. Instructions are at:

https://hpc.llnl.gov/manuals/access-lc-systems/logging

▪ Shark (SCF):— Accessible directly from anywhere within the SCF— LANL/Sandia: Kerberos authentication - same as other SCF machines. No

password/token required.— Note: as of today's presentation, shark is not quite yet available

Access

Page 5: Accounts, Access, User Environment Topics · PDF fileUser Environment Topics ... NVRAM SSD No Yes (Ray only) Job launcher srun mpirun (jsrun beta coming soon) 8 LLNL-PRES-729302

5LLNL-PRES-729302

▪ Expectations

▪ Big differences

▪ Like TOSS 3 (kinda)

▪ Beta software environment

▪ File systems

▪ Modules, dotkits

▪ Compilers (covered later)

▪ MPI (covered later)

User Environment Topics

▪ Running jobs & LSF batch system (covered later)

▪ Software

▪ Math libraries

▪ HPSS Storage, FIS

▪ Miscellaneous

▪ Documentation and getting help

Page 6: Accounts, Access, User Environment Topics · PDF fileUser Environment Topics ... NVRAM SSD No Yes (Ray only) Job launcher srun mpirun (jsrun beta coming soon) 8 LLNL-PRES-729302

6LLNL-PRES-729302

▪ Although Power8 and Pascal hardware are not brand new, putting them together is - especially true for the software.

▪ There will be a "learning curve" for all involved: vendors, LC staff and users alike.

▪ Much of the software is "beta" level, and some is still being developed as we speak.

▪ Expect some growing pains: unplanned outages, planned outages, bugs, reboots, changes (some with little notice), instabilities, performance issues, etc.

▪ What you might typically expect from new systems...and more!

▪ LC is interested in your feedback - we're in this together!

Setting Expectations

Page 7: Accounts, Access, User Environment Topics · PDF fileUser Environment Topics ... NVRAM SSD No Yes (Ray only) Job launcher srun mpirun (jsrun beta coming soon) 8 LLNL-PRES-729302

7LLNL-PRES-729302

Big Differences

Typical LC Linux Cluster CORAL EA Cluster

Hardware Intel Xeon IBM Power8 + NVIDIA Pascal

Multi-threading (CPU) 2 hardware threads per core 8 hardware threads per core

Peak flops (% GPU) 0% 97%

Job scheduler SLURM / Moab IBM Spectrum LSF

Parallel file systems Lustre IBM Spectrum Scale (GPFS)

Compilers Intel, GNU, PGI, Clang IBM XL, Clang (GNU, PGI,

xlflang)

MPI MVAPICH, Open MPI, Intel IBM Spectrum MPI

Packages dotkit, Tcl modules Lmod modules

NVRAM SSD No Yes (Ray only)

Job launcher srun mpirun (jsrun beta coming

soon)

Page 8: Accounts, Access, User Environment Topics · PDF fileUser Environment Topics ... NVRAM SSD No Yes (Ray only) Job launcher srun mpirun (jsrun beta coming soon) 8 LLNL-PRES-729302

8LLNL-PRES-729302

▪ Same OS - Red Hat Enterprise Linux Server release 7.3

▪ /usr/tce/ will be used instead of /usr/local for compilers, MPI, tools, packages, etc.— Currently /usr/tcetmp is being used but that will transition to /usr/tce later

▪ Lmod modules are used to load software environments

▪ However, CORAL EA systems do not run true TOSS 3 software:

TOSS 3 Like Environment (kinda)

ray23% echo $SYS_TYPE

blueos_3_ppc64le_ib

ray23% distro_version

blueos 3.0-0

quartz2306% echo $SYS_TYPE

toss_3_x86_64_ib

quartz2306% distro_version

toss 3.0-2.1

Page 9: Accounts, Access, User Environment Topics · PDF fileUser Environment Topics ... NVRAM SSD No Yes (Ray only) Job launcher srun mpirun (jsrun beta coming soon) 8 LLNL-PRES-729302

9LLNL-PRES-729302

▪ Much of the software on LC's CORAL EA systems is beta release.

▪ So, until GA-level software is installed, any performance results from applications/benchmarks running on these systems are not publishable without official review from IBM and/or NVIDIA . — Questions? Contact Rob Neely ([email protected]) or Bronis de Supinski

([email protected])

▪ Changing rapidly, possibly with little notice.

▪ clang-coral and xl beta compiler releases every few weeks

▪ More beta software on the way:— Job launcher beta (jsrun) - will replace mpirun.— Burst buffers— Cluster Systems Manager (CSM)

Beta Software Environment

Page 10: Accounts, Access, User Environment Topics · PDF fileUser Environment Topics ... NVRAM SSD No Yes (Ray only) Job launcher srun mpirun (jsrun beta coming soon) 8 LLNL-PRES-729302

10LLNL-PRES-729302

CORAL EA systems mount the usual LC file systems. The only significant difference is that these systems use IBM's Spectrum Scale product for parallel file systems instead of Lustre. Available file systems are summarized in the table below.

File Systems

File System Mount Points Backed

Up?

Purged? Comments

Home directories /g/g0 - g99 Yes No 16 GB quota; safest file system; includes

.snapshot online backups

Workspace /usr/workspace/ws* No No 1 TB quota; includes .snapshot online

backups

Local tmp /tmp, /usr/tmp, /var/tmp No Yes Node local temporary file space; small;

actually resides in node memory, not physical

disk

NFS tmp /nfs/tmp2 No Yes Large NFS mounted temporary file space;

shared by all users and multiple clusters

Collaboration /usr/gapps, gdata

/collab/usr/gapps, gdata

Yes No User managed application directories;

intended for collaborative development and

usage

Parallel /p/gscratchr (ray)

/p/gscratchrzm (rzmanta)

/p/gscratch# (shark - TBD)

No Yes Intended for parallel I/O; large, shared by all

users on a cluster

Page 11: Accounts, Access, User Environment Topics · PDF fileUser Environment Topics ... NVRAM SSD No Yes (Ray only) Job launcher srun mpirun (jsrun beta coming soon) 8 LLNL-PRES-729302

11LLNL-PRES-729302

▪ Sizes:— Ray (/p/gscratchr): 1.3 PB— RZManta (/p/gscratchrzm): 431 TB— Shark (/p/gscratch#): TBA

▪ We expect that, from a user perspective, application interactions with this new parallel file system will be similar to Lustre.

▪ We also expect to learn about differences as we acquire experience.

▪ oslic, rzslic, cslic clusters will eventually mount the respective gscratch file system for convenience.

▪ IBM Spectrum Scale product information is available at: http://www-03.ibm.com/systems/storage/spectrum/scale/index.html

File Systems - IBM Spectrum Scale /p/gscratch*

Page 12: Accounts, Access, User Environment Topics · PDF fileUser Environment Topics ... NVRAM SSD No Yes (Ray only) Job launcher srun mpirun (jsrun beta coming soon) 8 LLNL-PRES-729302

12LLNL-PRES-729302

▪ As with TOSS 3 systems, Lmod modules are used for most software packages, such as compilers, MPI and tools.— Dotkits have pretty much disappeared— Users only need to know a few commands to effectively use modules— The "ml" shorthand can be used instead of "module" - for example: "ml avail"

Modules, Dotkits

Command Description

module avail List available modules

module load package Load a selected module

module list Show modules currently loaded

module unload package Unload a previously loaded module

module purge Unload all loaded modules

module reset Reset loaded modules to system defaults

module display package Display the contents of a selected module

module spider List all modules (not just available ones)

module keyword key Search for available modules by keyword

module, module help Get help

Page 13: Accounts, Access, User Environment Topics · PDF fileUser Environment Topics ... NVRAM SSD No Yes (Ray only) Job launcher srun mpirun (jsrun beta coming soon) 8 LLNL-PRES-729302

13LLNL-PRES-729302

▪ Simple example - see what's loaded by default, see what's available, load a selected module, check again:

More on Modules

ray23% module list

Currently Loaded Modules:

1) xl/beta-2017.04.11 2) spectrum-mpi/2017.04.03 3) StdEnv

ray23% module avail

---------- /usr/tcetmp/modulefiles/Compiler/xl/beta-2017.04.11 ------

spectrum-mpi/2017.04.03 (L)

----------------------- /usr/tcetmp/modulefiles/Core -------------------------

StdEnv (L) makedepend/1.0.5

clang/coral-2017.03.15 pgi/16.10

clang/coral-2017.03.29 (D) pgi/17.1

clang/3.9.1 pgi/17.3 (D)

cmake/3.7.2 totalview/2016.07.22

gcc/4.8-redhat totalview/2017.0.12 (D)

gcc/4.9.3 (D) xl/beta-2017.03.28

git/2.9.3 xl/beta-2017.04.11 (L,D)

gmake/4.2.1 xl/2016.12.02

gsl/2.3

---------------- /usr/share/lmod/lmod/modulefiles/Core -------------------

lmod/6.5 settarg/6.5

Where:

L: Module is loaded

D: Default Module

Use "module spider" to find all possible modules.

Use "module keyword key1 key2 ..." to search for all possible

modules matching any of the "keys".

ray23% module load clang/coral-2017.03.15

Lmod is automatically replacing "xl/beta-2017.03.28" with

"clang/coral-2017.03.15"

Due to MODULEPATH changes the following have been reloaded:

1) spectrum-mpi/2017.04.03

ray23% module list

Currently Loaded Modules:

1) StdEnv 2) clang/coral-2017.03.15 3) spectrum-mpi/2017.04.03

Page 14: Accounts, Access, User Environment Topics · PDF fileUser Environment Topics ... NVRAM SSD No Yes (Ray only) Job launcher srun mpirun (jsrun beta coming soon) 8 LLNL-PRES-729302

14LLNL-PRES-729302

▪ CORAL EA systems pre-load certain modules into your environment— Important to know for selecting your choice of compiler. For example:

▪ LC employs module hierarchies for some packages:— loading module A will cause modules B and C to become available

▪ Module families are also used:— only one package from each family may be loaded at once— if compiler A is loaded, and then compiler B, compiler A will be unloaded

▪ A number of modules have default versions - designated by a (D) next to the module name. For example:

totalview/2016.07.22totalview/2017.0.12 (D)

"module load totalview" will select the (D) version

More on Modules

ray23% module list

Currently Loaded Modules:

1) xl/beta-2017.04.11 2) spectrum-mpi/2017.04.03 3) StdEnv

Page 15: Accounts, Access, User Environment Topics · PDF fileUser Environment Topics ... NVRAM SSD No Yes (Ray only) Job launcher srun mpirun (jsrun beta coming soon) 8 LLNL-PRES-729302

15LLNL-PRES-729302

▪ More on Lmod modules: https://www.tacc.utexas.edu/research-development/tacc-projects/lmodhttp://lmod.readthedocs.io/en/latest/index.html

▪ LC documentation:https://lc.llnl.gov/confluence/display/TCE/Using+TOSS+3#UsingTOSS3-Modules

More on Modules

Page 16: Accounts, Access, User Environment Topics · PDF fileUser Environment Topics ... NVRAM SSD No Yes (Ray only) Job launcher srun mpirun (jsrun beta coming soon) 8 LLNL-PRES-729302

16LLNL-PRES-729302

Software

▪ The most important software - Compilers, MPI and Tools, will be covered in detail later.

▪ CUDA 8.0 - installed under /usr/local - with links in /usr/tcetmp/packages for convenience. More info about NVIDIA software will be covered later.

▪ A small assortment of other software/utilities can be found in /usr/tcetmp/bin, /usr/tcetmp/packages or via "module spider".

▪ Visualization software (list is at https://hpc.llnl.gov/data-vis/vis-software): a subset of these packages will be ported to CORAL EA. Currently under evaluation.

▪ Software under /usr/gapps is owned and maintained by users - porting to CORAL EA systems will vary.

▪ Need something that's missing? Let us know (LC Hotline)...

Page 17: Accounts, Access, User Environment Topics · PDF fileUser Environment Topics ... NVRAM SSD No Yes (Ray only) Job launcher srun mpirun (jsrun beta coming soon) 8 LLNL-PRES-729302

17LLNL-PRES-729302

Software - Math Libraries

▪ MASS - Mathematical Acceleration Subsystem Libraries

— From IBM: a set of C/C++ libraries of tuned mathematical intrinsic functions (scalar, vector, simd) that provide improved performance over the corresponding standard system math library functions.

▪ ESSL - Engineering and Scientific Subroutine Library

— IBM's ESSL is a collection of high-performance subroutines providing a wide range of mathematical functions for many different scientific and engineering applications. A subset of the functions contained in ESSL are tuned replacements for some of the functions provided in the BLAS and LAPACK libraries. C/C++ and Fortran.

▪ Installed under /usr/tcetmp/packages for convenience:

— NETLIB: BLAS, LAPACK, ScaLAPACK

— FFTW

— GSL - GNU Scientific Library - over 1000 functions; C/C++

▪ Documentation: https://lc.llnl.gov/confluence/display/CORALEA/Math+Libraries

▪ Other math software (matlab, mathematica) - no plans to port at this time

Page 18: Accounts, Access, User Environment Topics · PDF fileUser Environment Topics ... NVRAM SSD No Yes (Ray only) Job launcher srun mpirun (jsrun beta coming soon) 8 LLNL-PRES-729302

18LLNL-PRES-729302

HPSS Storage, FIS

▪ CORAL EA systems do not currently have access to OCF/SCF HPSS storage

— Awaiting some infrastructure work - stay tuned

▪ FIS (File Interchange Service) - for moving files between OCF and SCF

— Available on CORAL EA systems

Page 19: Accounts, Access, User Environment Topics · PDF fileUser Environment Topics ... NVRAM SSD No Yes (Ray only) Job launcher srun mpirun (jsrun beta coming soon) 8 LLNL-PRES-729302

19LLNL-PRES-729302

Miscellaneous

▪ Login nodes vs. compute nodes:

— As with other LC systems, login nodes are shared by multiple users and should not be used for compute intensive or parallel work.

— CORAL EA login nodes do NOT have GPUs - only the compute nodes do

▪ Spack - package management tool designed to support multiple versions and configurations of software on a wide variety of platforms and environments. For details see: http://spack.readthedocs.io/

▪ X-Win32 2012 users: if things aren't working right, you will probably need to get a more recent build (at least build 102) installed. Contact 4-HELP or your local desktop support person.

— Using LANDesk to download a more recent version may not work because it doesn't de-install the old version.

Page 20: Accounts, Access, User Environment Topics · PDF fileUser Environment Topics ... NVRAM SSD No Yes (Ray only) Job launcher srun mpirun (jsrun beta coming soon) 8 LLNL-PRES-729302

20LLNL-PRES-729302

Documentation and Getting Help

▪ BEST place to get started for user information is the CORAL EA Systems confluence wiki page:

https://lc.llnl.gov/confluence/display/CORALEA/CORAL+EA+Systems

or just go to the LC confluence wiki https://lc.llnl.gov/confluence and search for "coral"

Page 21: Accounts, Access, User Environment Topics · PDF fileUser Environment Topics ... NVRAM SSD No Yes (Ray only) Job launcher srun mpirun (jsrun beta coming soon) 8 LLNL-PRES-729302

21LLNL-PRES-729302

https://lc.llnl.gov/confluence/display/CORALEA/CORAL+EA+Systems

Page 22: Accounts, Access, User Environment Topics · PDF fileUser Environment Topics ... NVRAM SSD No Yes (Ray only) Job launcher srun mpirun (jsrun beta coming soon) 8 LLNL-PRES-729302

22LLNL-PRES-729302

Documentation and Getting Help

▪ Best place to get started for user information:https://lc.llnl.gov/confluence/display/CORALEA/CORAL+EA+Systems

— Includes information about IBM & NVIDIA hardware, compilers, MPI, running jobs & LSFbatch system, tools, math libraries, user environment topics, quickstart guide + more

— Lots of links to in-depth information on related topics

— Also includes a general discussion blog and "symptoms & solutions"

• Your contributions are welcome!

• Questions or problems? Check here first to see if there's already an answer/solution

▪ Reporting problems, questions and getting help:

— The LC Hotline is available as the "front line" of support to CORAL EA systems - as with other LC systems: [email protected] (925) 422-4531

— Referrals to other LC staff, IBM and NVIDIA onsite reps

Page 23: Accounts, Access, User Environment Topics · PDF fileUser Environment Topics ... NVRAM SSD No Yes (Ray only) Job launcher srun mpirun (jsrun beta coming soon) 8 LLNL-PRES-729302

23LLNL-PRES-729302

Documentation and Getting Help

▪ On-site IBM and NVIDIA support:

— To help address the challenges of adapting to new technologies delivered in the CORAL systems, IBM and NVIDIA provide dedicated full-time, on-site support for the duration of the CORAL contract. This support helps facilitate efficient interaction with IBM and NVIDIA technical engineering to resolve issues with the systems and software.

• System Administrator and Spectrum Scale (GPFS) Subject Matter Expert (James Lamb, IBM)

– Hardware and system software.

• NVIDIA Solutions Architect (Max Katz, NVIDIA)

– All things related to the use of the NVIDIA GPUs

• Application Analyst (Roy Musselman, IBM)

– Compilers, MPI, math libraries, IBM tools

— These experts are highly integrated into the Livermore Computing (LC) support teams: Development Environment Group (DEG) and System Administration Group (SAG), and are a supplemental part of the total support structure which includes the LC Hotline.

Page 24: Accounts, Access, User Environment Topics · PDF fileUser Environment Topics ... NVRAM SSD No Yes (Ray only) Job launcher srun mpirun (jsrun beta coming soon) 8 LLNL-PRES-729302

24LLNL-PRES-729302

Documentation and Getting Help

▪ Sierra Center of Excellence (COE):— Provides resources for ensuring application readiness for Sierra

— Includes development and vendor support from IBM and NVIDIA; Funded by ASC

— https://lc.llnl.gov/confluence/display/SCOE/Sierra+Center+of+Excellence+Home

• Excellent resource for information on Sierra related topics, workshops, presentations, etc.

• However, access is restricted due to NDA material

▪ Institutional Center of Excellence (iCOE) Project— Complementary to the Sierra COE, but for institutional (M&IC) programs

— https://lc.llnl.gov/confluence/display/SCOE/Institutional+Center+of+Excellence+%28iCOE%29+Project

▪ Advanced Architecture and Portability Specialists (AAPS) team— https://lc.llnl.gov/confluence/display/AAPS/Advanced+Architecture+and+Portability+Specialists

• Source of knowledge dissemination for work done with specific ASC / Tri-lab codes

▪ Sierra COE and iCOE projects: Talk to Rob Neely (COE) or Bert Still / Ian Karlin (iCOE) if you have any questions/issues.

Page 25: Accounts, Access, User Environment Topics · PDF fileUser Environment Topics ... NVRAM SSD No Yes (Ray only) Job launcher srun mpirun (jsrun beta coming soon) 8 LLNL-PRES-729302