Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
Helping Advantage Research Computing in CanadaLixin LiuSimon Fraser UniversityOctober 15, 2019
Outline
• Compute Canada and Canadian ARC Funding
• Cedar System and Hardware
• Lustre File System
• Cedar Provisioning & Management
• Software Application Delivery
• Resource Allocation and Scheduler
• Compute Canada Support Model
2
Compute Canada and Canadian ARC Funding
Compute Canada
• Non-profit organization to support research in Canadian public institution
• 4 Regional consortia: ACENET, Calcul Quebec, Compute Ontario, WestGrid
• 37 member institutions, all public universities
• More than 200 analysts and systems administrators
• Operate 5 large national ARC sites, arbutus, cedar, graham, Niagara, beluga
• Participate many national and international collabrations, e.g., Atlas
ARC Funding in Canada
• Federal government provides majority of ARC funding by CFI/ISED
– CFI: Canadian Foundation for Innovation, National Platform Fund
– ISED: Ministry of Industry, Science & Economic Development, Cyber Infrastructure Fund
• Provincial matching fund
• Vendor in kind matching fund
3
Compute Canada Clusters
• CC issued RFP call to host national ARC systems
• 5 hosting institutions are selected after review
– Arbutus (GP1): Victoria, openstack clou
– Cedar (GP2): SFU, general purpose cluster with GPUs
– Graham (GP3): Waterloo, general purpose cluster with GPUs
– Niagara (LP): Toronto, large parallel jobs only
– Beluga (GP4): McGill, general purpose cluster with GPUs
• CC & Hosting institutions issued RFPs to purchase
– Systems at 4 Stage-1 sites (GP1-3 and LP) and 1 Stage-2 site (GP4)
– National Data Cyberinfrastructure (long term storage at all stage-1 sites)
– WAN network equipment (100GE to National RE network at stage-1 sites)
– Scheduler, parallel filesystem for all clusters
4
SFU Data Centre• Location – Water Tower Building
– SFU campus on Burnaby mountain, 15km from downtown Vancouver
– Built in 1969 by BC Hydro as the Control Centre, covering 90% BC population
– 9000+ SF on ground floor space on concrete slap
– Divide into High Availability zone (17KW/rack) and High Density zone (35KW/rack)
– 7/24 NOC with operators on site
• Power– 3.5MW available, can upgrade to 10MW
– 1MW 1+1 UPS power with diesel backup (HA zone)
– Output 3-phase 240/415V
• Cooling Towers and Chillers with estimated PUE 1.07– Stage 1: radiative cooling during colder days
– Stage 2: evaporative cooling when web-bulb temperature under 20C
– Stage 3: chillers as needed (should only happen a few days each year)
5
6
Cedar Cluster Information – Stage 1• Timeline
– GP2 Stage-1 RFP was issued by SFU in early June 2016, closed on July 26, 2016
– Proposals were reviewed by SFU and Compute Canada members
– Awarded to Scalar Decisions in September 2016
– Cluster was installed in Q1 2017 and passed acceptance tests on April 15, 2017
– SFU officially announced on April 20, 2017 and cluster in production on July 25, 2017
• Cluster nodes (902 nodes, 27696 cores, 584 P100, 186TB RAM)– 576 base, Dell C6320, 2 E5-2683v4, 128GB RAM, 2 480GB SSDs
– 128 large, Dell C6320, 2 E5-2683v4, 256GB RAM, 2 480GB SSDs
– 24 bigmem512, Dell C6320, 2 E5-2683v4, 512GB RAM, 2 480GB SSDs
– 24 bigmem1500, Dell R630, 2 E5-2683v4, 1.5TB RAM, 2 480GB SSDs
– 4 bigmem3000, Dell R930, 4 E7-4809v4, 3TB RAM, 2 480GB SSDs
– 114 base GPU, Dell C4130, 2 E5-2650v4, 4 P100/12GB, 128GB, 800GB SSD
– 32 large GPU, Dell C4130, 2 E5-2650v4, 4 P100/16GB, 256GB, 800GB SSD
– Others: 8 head nodes, 5 DTN nodes, 8 management nodes, TDS partition
7
Cedar Cluster Information - Stage 1• Interconnect
– Intel Omni-Path, 16 spine switches, 30 leaf switches
– Leaf switch connects to a 32-node island with 16 uplinks, 1 to each spine
– 29+2 compute islands, 1+1 for storage/service island, 2:1 blocking factor
• Global Storage & File Systems
– DDN SFA 14KX with 640 8TB drives, 4PB, Lustre filesystem
– 4 Embedded OSS servers running on SFA controllers, 64 OSTs
– 2 MDS servers with a EF4024 disk shelf, 6TB RAID 10, 2 MDTs
– 35GB/s read and 32GB/s write
• Rack Power and Cooling
– Each compute rack has 1 3P 60A PDU and UHD RDHX, 35KW power/cooling
– Storage/service racks have 2 3P 30A PDUPDUs and HD RDHX, 17.5KW
– Balance CPU and GPU islands to optimize of space, power & cooling
8
Cedar Cluster Information – Stage 2• Stage 2 expansion in Spring 2018
• Nodes
– 640 Skylake nodes, total 30,720 cores
– 122TB total memory
• Interconnect
– Intel Omni-Path, 16 spine switches, 20 leaf switches
– Adding 4 core switches to connect Stage 1 and Stage 2, 1:8 blocking
• Storage
– Expand SFA 14K by 200 8TB disks, all for scratch
9
Cedar Cluster Information – Stage 3• Stage 3 expansion fund is provided by ISED CI fund
• Installation planned in late 2019
• Nodes
– 768 Cascade Lake nodes, total 36,864 CPU cores
– 192 GPU nodes, 768 V100, 6144 CPU cores
– 184TB total memory
• Interconnect
– Intel Omni-Path, 16 spine switches, 30 leaf switches
– Adding additional 4 core switches to connect Stage 1 and 2, 1:4 blocking
10
Cedar Cluster Interconnect
11
Cedar Cluster Benchmark• Stage 1 HPL benchmark was performed on GPU nodes only
– TOP500 ranking in June 2017: No.86, 1,337TF
– GREEN500 ranking in June 2017: No.13, 8GFlops/Watts
• Installation planned in late 2019
• Stage 2 HPL benchmark was performed on CPU nodes only– TOP500 ranking in November 2018: No.190, 1,633TF
– Using hybrid HPL code from Intel to balance Skylake and Broadwell nodes
• Stage 3 HPL benchmark planned in Spring 2020
12
Cedar Cluster Benchmark
13
Persistent Storage• Long term storage is funded by National Data Cyberinfrastructure Fund
• Storage type include
– Lustre File system (project filesystem), 20PB
– dCache storage, 10PB
– Openstack Ceph, 3.5PB
– Offline/Nearline tape storage, 60PB
• Allow direct access to these storage from Cedar
• Globus is the preferred option to move data between sites
14
Lustre Filesystems – Home & Scratch• Current /home and /scratch
– SFA 14K with 840 8TB disks, EXA 4.2
– ldiskfs backend, 8+2 RAID6
– 4 Embedded OSS servers, 4 OSTs for /home and 80 for /scratch
– Major performance issues happened after stage 2 expansion
• Planning new /home
– move /home to a DDN SS9012 with 2 Dell R640 as OSS servers
– use ZFS backend OSTs: 12+2 RAIDZ2 with one SSD cache, total 6 OSTs
– move MDT from SAS to SSD based hardware RAID 10 storage, ldiskfs
• Planning /scratch changes
– Replace embedded SFA 14K controller to block based controller
– Add 4 Dell R640 as OSS servers
– Keep original MDS/MDT and OSTs
15
Lustre Filesystems – ProjectProject filesystem is based on CC storage building block, option 4
• Community version 2.10.7
• Each pair of OSS servers connect to 4 disk enclosures in JBOD mode
• Servers: Dell R630/640, 4 OSTs per server
• Enclosures: Seagate SP2584, SP3106 and SS9012
• ZFS backend, RAIDZ2 with SSD L2ARC cache
• OST level failover only, not using Multipathing
• 2 MDS servers using Dell R630, ldiskfs for MDT
• Directory structure is organized by projects and group quota is used
• Waiting to migrate to project quota
• Plan for 2.12 migration in 2020
16
Lustre Filesystems – ProjectPerformance Issues
• Initial MDT is using 24 SAS drives, RAID 10
• Frequent high load on MDS and OSS servers, mostly caused by bioinformatic jobs, like blast
Resolution
• Replaced all SAS disks by SSDs in March 2019
• Significant improvement observed on both MDS and OSS loads
17
Lustre HSM Filesystem – NearlineHSM Design
• Lustre HSM Filesystem using Robinhood and TSM copytool solution is developed by CC members, mostly by Simon GuilBault
• IBM Spectrum Protect (aka TSM) is used for backup at all CC sites
• Cross sites replication is planned
• Use Robinhood as a policy engine and storage changelog in MySQL
• Lhsmtool_cmd calls a script to `archive’ files to Spectrum Protect server
• Keep 2 copies of data on tape
Cedar implementation
• 1 DSS7000 with 8 OSTs (ldiskfs 9+2 RAID6) as OSS server
• Share the same MDS server with /project
• Directory structure is similar to Project and we use project GID as the project id for project quota
18
Lustre Filesystems – Nearline
19
Auto Provisioning and Management (ADAM)
• Compute nodes are running CentOS 7.5
• OS is installed in memory (ramdisk)
• Local disks are used as local scratch only
• Using iPXE to boot an unconfigured node with temporary IP address
• Register node information and assign a permanent IP address
• Performance node firmware update automatically during boot
• 2-stage installation process to boot 1600+ nodes within 60 minutes
• System provisioning using Puppet
• User authentication by Compute Canada LDAP services
• Global syslog collection and centralized monitoring
20
Software Application Delivery• CVMFS is used to provide nation wide software distribution
• Software in CVMFS is maintain by CC analysts from all regions
• Use Nix/Easybuild to install software to CVMFS Stratum 0 server
• CVMFS Stratum 1 servers available on East and West
• Local sites have dedicated Squid Servers
• CC provided training for analysts to use Nix, EasyBuild and Lmod to build and install software in CVMFS
• Local (cedar only) software is installed on a NFS server
• Singularity Container support
• Module environment to run applications
• Job scheduler: Slurm
21
Resource Allocation & Scheduler• All CC resources are free to Canadian researchers in public institutions
• Projects have a default allocation (coreyear CPU, TB storage)
• CC issues resource allocation call every year include
– Resources for Research Groups (RRG), award for 1 year
– Resource Platform and Portals (RPP), award for 3 years
– Rapid Access Services (RAS), short term, no application necessary
• Allocations: CPU (coreyear), GPU (gpuyear) & Storage (TB)
• Applications are reviewed by science and technical committees
• Allocation data is integrated into LDAP and pulled into Slurm DB
• Job walltime: 12 hours to 28 days
• Multiple partitions: by-core, by-node, by-gpu, by-gpu_node
• Cedar is suitable for serial & smaller to medium size parallel jobs
22
Compute Canada Support Model• Cedar System team maintains hardware, OS, interconnect, storage
• CC RSNT maintains software CVMFS and interface with users
• CC support staff will help any user to use these resources
• OTRS support ticketing system
• National help desk
• Other national teams provide various support and consulting
• With consent, CC analysts can use “ccsudo” to access users’ home directories, debug users’ problems. “ccsudo” writes audit logs
23
SFU Data Centre
24
SFU Data Centre
25
Cooling Towers
26
Mechanical Room
27
Rack Power and Cooling
28
Cedar Compute racks
29
Acknowledgement
30
B.C. KnowledgeDevelopment Fund
Questions?
31