35
Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 [email protected] McGill University / Calcul Québec / Compute Canada Montréal, QC Canada

[email protected] November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 [email protected]

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

Guillimin HPC Users MeetingNovember 10, 2016

[email protected]

McGill University / Calcul Québec / Compute CanadaMontréal, QC Canada

Page 2: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

• Please be kind to your fellow user meeting attendees • Limit to two slices of pizza per person to start please• And please recycle your pop cans.• Thank you!

2

Page 3: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

• Compute Canada News• System Status• Software Updates• Training News• Special Topic

• CernVM File System (CVMFS)

Outline

3

Page 4: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

• (Reminder) 2017 Resource Allocation Competitions• More information here:

https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/

• Competition Information Sessions (slides):https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/#2017

Compute Canada News

4

Process Opens Due

Fast Track application (by invitation only) Early October Nov 9, 2016

RRG and RPP full application Early October Nov 24, 2016

Announcement & Implementation of awards Early March 2017 Mid April 2017

Page 5: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

• At which point should you request Resources for Research Groups (RRG) on Guillimin:• More than 2 times the default CPU allocation:

• RAS: Default priority level: 30 core*years• Resource name: Guillimin - phase 1 or phase 2

• More than 1 GPU*year or 1 MIC*year• RAS: small fraction of GPU or MIC

• More than 5 times the default storage allocation:• RAS: Default project space allocation : 1 TB (up to 5 TB on

demand, but not guaranteed)• Resource name: DataSTAR, total 4 PB for RAC 2017

• Storage space on tape (archive or backup)• RAS: only Home directories are saved in backup system• Resource name: DataSTAR, Guillimin - phase 2

Compute Canada News

5

Page 6: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

• November 1: downtime for scheduled site-wide power outage• Clean shutdown of VMs on GPFS, nodes and GPFS• Maintenance of the UPS system• Clean shutdown of remaining VMs and services

• November 2: brought services back online• Tested: GPFS, job scheduling, nodes• Tried to update but rolled-back: Matlab license• Reopened access to: login nodes, scheduler

• November 4: fixed LDAP issue, account creation• Increased some number of lock files

• November 8: Infiniband issue - GPFS hang• Needed to restart the IB Fabric and GPFS

System Status

6

Page 7: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

• Space Management• /gs is full: 99% used, 51TB free (as of Nov. 7)

• For better space management we continue to migrate cold data from disk to tape• Metadata remains on disk• Users can still access their files through usual

methods, but with an increased latency• Storage space is a precious resource - manage it

wisely!• Delete temporary files, compress large files not

frequently accessed, tar many smaller files into collections, …

Storage Status

7

Page 8: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

• Maven/3.3.9 - Apache Maven: build manager for Java projects

New Software Installations

8

Page 9: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

• All upcoming events: calculquebec.eventbrite.ca• Nov. 23 - Programmation en R intermédiaire (U. Montreal)

• Recently completed:• Oct. 27 - Software Carpentry (U. Montreal)• Oct. 27 - Introduction to Intel Xeon Phi (McGill U.)• Nov. 8 - Easy GPU programming with OpenACC (U. Laval)• Nov. 9 - Introduction aux serveurs de calcul (U. Sherb.)

• All materials from previous workshops are available online: wiki.calculquebec.ca/w/Formations/en

• All user meeting presentations online at www.hpc.mcgill.ca

Training News

9

Page 10: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

• Questions? Comments?• We value your feedback. Contact us at:

[email protected]

• Guillimin Operational News for Users– Status Pages

• http://www.hpc.mcgill.ca/index.php/guillimin-status• http://serveurscq.computecanada.ca (all CQ systems)

– Follow us on Twitter• http://twitter.com/McGillHPC

User Feedback and Discussion

10

Page 11: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

McGill University / Calcul Québec / Compute CanadaMontréal, QC Canada

CernVM File System (CVMFS)November 10, 2016

[email protected]

Page 12: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

Outline

• What is CVMFS• How it works

• Structure, Technology and Workflow• CVMFS in Compute Canada Projects

• GenAP, MUGQIC, SoftCC• Outlook and Support• Conclusion

12

Page 13: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

• CVMFS: CERN Virtual Machine File System (CernVM-FS)• Designed to deliver software in a fast, scalable, reliable and

distributed way.• A file system hosted on standard web servers and mounted

by clients on universal user space (/cvmfs) using as a POSIX read-only file system in user space (a FUSE module)

• Software can be installed in one location and cached on demand anywhere using CVMFS technology.

• aggressive caching and reduction of latency, CernVM-FS focuses specifically on the software use case (small files)

• Recent development is extending to Data files (large files)• Originally developed for the LHC (Large Hadron Collider)

experiments to optimally deliver software for VM images and as a replacement for different software installation areas at many distributed locations (>200 HPC sites)

What is CVMFS?

13

Page 14: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

CVMFS Structure example, at the LHC

14

Page 15: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

• FUSE kernel module is used• Virtual FS loading data only on access• All data/software is hosted on a CernVM-FS

repository (the stratum-0).

CVMFS Technology

15

Page 16: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

Opening a file on CernVM-FSOpening a file on CernVM-FS

Client side

16

Opening a file on CernVM-FS● Name resolution via an SQlite catalog● File downloads are verified against the

cryptographic hash of the corresponding catalog entry

Page 17: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

CVMFS distribution Workflow

17

->Library node----->Stratums--------->Clients● If the file is not in the local cache it is fetched from the

squid and if not present, the squid fetches from stratum-1 (Stratum-1 is always a replica of stratum-0)

Page 18: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

• A node on which the developer, librarian interacts with the repository in order to publish files (software or data files)• Protection mechanisms exist to protect publishing integrity.

E.g one librarian at a time (a single librarian account)

The Repository (Librarian) Node

18

• Files published via a publish command to stratum0

• Publishing changes tracked by union file system combines a CernVM-FS read-only mount point and a writable scratch area.

Lib node

Page 19: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

• A Sudo librarian account is used to manage software injection to cvmfs for Compute Canada Projects.• Commands: cvmfs_server transaction, cvmfs_server

publish, cvmfs_server abort

Example of a Librarian Interaction

19

Page 20: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

Stratum0 • A protected read/write instance of the files

• It feeds up the public, distributed mirror webservers (the stratum1’s)

• A distributed hierarchy of proxy servers (squids) fetches content from the closest public mirror server.

• Projects can share a Stratum0 or each project can have their own. Perhaps depending on repository sizes and purposes• Example in Compute Canada: we have one for

MUGQIC and for Compute Canada Software (softcc)

Stratum0, Stratum1

20

Page 21: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

Squids:• A caching proxy for the Web supporting HTTP,

HTTPS, FTP, and more.• It reduces bandwidth and improves response times by

caching and reusing frequently-requested web files. Squid has extensive access controls and makes a great server accelerator.• Depending on cluster size, a site can employ two or

more squids and also define more than one squids for failover.

Squid Caches

21

Page 22: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

• CernVM-FS is controlled by autofs and automount

• Base dir is “/cvmfs”• CVMFS clients can be:

• Computer node (metal)• VMs• Containers

• Can also deliver files to clouds S3 space

• Clients can use local cache or alien cache

CVMFS Clients

22

Page 23: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

Local cache:-Local to the node

“Client end” Cache types

● Local cache: Local to the node

● Alien cache: files on shared file system in a cache outside the control of CernVM-FS.

● NFS server mode: Exporting a single CernVM-FS client via nfs to Compute nodes

Example of a compute node with a local cache of software at Guillimin

Page 24: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

• Software Distribution:• Delivers software to clusters.• Clusters can load only the relevant software and

versions• Software can be organised into a software tree desired

to meet different OS type, versions, architecture and modularization.

• Data federation (new!): StashCache• A cooperating set of storage resources transparently

accessible across a wide area network via a common namespace.

Applications (1)

24

Page 25: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

• Workflows for upto 10TB of data can be achieved

• 500MB/job of data has been delivered using StashCache.

Good application for remote data access.

StashCache Example

25

cvmfs

Page 26: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

1. Centralised Software maintenancea. Single point injections and propagation to multiple

sites/clientsb. Software versioning across sites is reinforcedc. Software mounts on VMs, Containers and Compute nodes

(metal)d. Software prerequisites can be installed to meet site

compatibilitiese. Low maintenance, high scale delivery

2. Files are fetched using standard http protocol from the server3. Files cached on demand in order to reduce local network traffic

a. Tuning possible to have longer cache life4. The CVMFS structure is highly scalable and redundant.5. Failover mechanism achieved using local and remote squids

Advantages of CVMFS for Software (1)

26

Page 27: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

• GeNAP-Genetics and Genomics Analysis Platform https://genap.ca/public/home• Currently over 100 bioinformatics tools, packages

and pipelines transparently distributed• Compute Canada Software

• Presently each cluster has its own separate software module system

• CVMFS technology will be used in the new systems • Other projects:

• https://bitbucket.org/mugqic/mugqic_pipelines• Also Sub-atomic Physics projects (traditional users)

Applications Within Compute Canada

27

Page 28: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

Present GenAP, MUGQIC CVMFS Structure in CC

28

GP2 (uvic)

Also Briaree

West East

Page 29: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

GeNAP CVMFS deploying software to Clusters

29

GENAP CVMFSSource: http://cggony.wixsite.com/genap-v1

Page 30: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

• SoftCC-CVMFS:- Software for compute Canada sites• Software distribution targeted for the new sites.• Under test and piloting • Softcc cvmfs repository will be deployed on the GP2

and GP3, and accessible to new and other CC systems for a standard set of software that will be identical at all locations

SoftCC CVMFS

30

Page 31: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

• Main development done by CERN• In Compute Canada, adaptation, deployment and

minimal developmental tuning to suite CC projects is done by the CVMFS group.• Infrastructure to deploy global CVMFS to the new sites

is being deployed• CC support email: [email protected]

Support

31

Page 32: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

End user experience (demo on a log in node)

32

● Cvmfs_config command available to check cvmfs

Page 33: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

❖ CernVMFS technology is very useful in software distribution and recently, for Data.➢ Low maintenance but high in scalability. Can serve

hundreds of clusters with only one software injection point.

❖ Beyond sub-atomics physics projects, two Bioinfomatics CC projects have adopted it and is to be implemented for all sites in compute Canada as the standard way to distribute software.

Questions?

Conclusion

33

Page 34: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

EXTRA SLIDES

34

Page 35: guillimin@calculquebec.ca November 10, 2016 Guillimin HPC ... · Guillimin HPC Users Meeting - November 2016 Guillimin HPC Users Meeting November 10, 2016 guillimin@calculquebec.ca

Guillimin HPC Users Meeting - November 2016

• Use of the the Fuse kernel module that comes with in-kernel caching of file data and file attributes

• Cache quota management • Use of a content addressable storage format resulting in immutable files

and automatic file de-duplication • Possibility to split a directory hierarchy into sub catalogs at user-defined

levels• Automatic updates of file catalogs controlled by a time to live stored

inside file catalogs • Digitally signed repositories • Transparent file compression/decompression and transparent file

chunking • Capability to work in offline mode providing that all required files are

cached • File system versioning and hotpatching• Dynamic expansion of environment variables embedded in symbolic links

CVMFS Key Features, desirable for software delivery

35