40
WG Town Hall Lindsay Sill, Executive Director Martin Siegert, SFU (GP2/Cedar) Site Lead Sergiy Stepanenko, University of Saskatchewan Site Lead Erin Trifunov, Manager Projects & Outreach Friday, January 27, 2017

WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

WG Town Hall

Lindsay Sill, Executive DirectorMartin Siegert, SFU (GP2/Cedar) Site Lead

Sergiy Stepanenko, University of Saskatchewan Site LeadErin Trifunov, Manager Projects & Outreach

Friday, January 27, 2017

Page 2: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

Introduction & Outline

1. New systems updates2. Bugaboo storage system issues3. Cedar / GP2 System4. Migration Updates5. RAC / Growing Needs6. Upcoming Training Opportunities

Page 3: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

Admin

Questions:● email “[email protected]”, OR● use Vidyo chat (for those on Vidyo)

Please MUTE yourself if you’re connected via Vidyo and not speaking

Page 4: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

New System Updates

Lindsay SillExecutive Director

WestGrid

Page 5: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

Technology Deployment Overview

● Major deployment of new resources underway:○ National Data Cyberinfrastructure○ New Cloud resources○ New HPC resources○ New Services

● Technology Briefing published by Compute Canada in November.

● Cloud Strategy & Services document updated in December.

Page 6: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

Stage 1 Award to Implementation

2015

Award notification, June, 2015

2016 2017 2018

Award Finalization February,

2016

1st System (Arbutus)

OperationalSept., 2016

Cedar, Graham

OperationalApril, 2017

Niagara OperationalEnd, 2017

Target - all four major new systems in full production less than 2 years after award finalization. Software services development continues through 2018.

(Note - Niagara schedule purposely delayed by recommendation of CFI expert panel, to benefit from technology improvements)

Page 7: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

National Systems UpdateCompute

System Status In-production Estimate

Arbutus (GP1, UVic)

● West.cloud.computecanada.ca: 7,640 cores DONE (Sep, 2016)

Cedar(GP2, SFU)

● Equipment is currently being delivered April 2017

Graham(GP3, Waterloo)

● Shipping planned for 1-st week of February.● Renovations almost complete (end of January)

April 2017

Network ● Preferred vendor testing. 2017

Parallel FS ● In progress. Will be ready for PROJECT and SCRATCH. February 2017

Scheduler ● Open-source Slurm with commercial support. Small test/dev cluster in cloud.

February 2017

Niagara(LP1, Toronto)

● Early discussion● RFP expected to go out in February.

Late 2017

Page 8: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

National Systems UpdateStorage

System Status In-production Estimate

Silo Interim ● Waterloo: migration complete● SFU: migration underway

Available(see following slides)

NDC-SFU ● PO’s waiting for signatures● Vendor is ready-to-go

April 2017

NDC-Waterloo ● 13 PB of SBB’s delivered.● Waiting for datacentre completion

Available: April 2017● Aim: March, 2017

NDC - Object Storage

● Object Storage. DDN WOS.● Initial prototype for internal testing installed

on cloud.

Mid 2017Aim: April, 2017

Attached● Scratch

● High performance storage attached to clusters

Purchased with the clusters

NDC = “National Data Cyberinfrastructure”

Page 9: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

Cedar / GP2 System Details

Martin SiegertWestGrid & National Site Lead

Simon Fraser University

Page 10: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

Bugaboo Storage Issues

Bugaboo has been having storage system issues since Christmas.● The old DDN 10K system● Series of disc, controller and corresponding software problems.

Currently down.● Working with DDN.● File system corruption - about 6000 files affected in /global/scratch;

Estimated time to fix: one month● Options:

○ Continue fixing the problems, Bugaboo unavailable for one month○ Stop the fix, bring Bugaboo back up, loose 3000+ files

Decision has been made to proceed with latter option.

Page 11: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

Coming Soon - Cedar and Graham

These will be the most powerful CC systems ever, with multiple node types to meet a variety of needs.

- Compute nodes, with local storage

- NVIDIA “Pascal” GPU nodes- Bigmem nodes

Delivery, installation and configuration will be happening from late January through March.

The DDN storage /scratch for Cedar (GP2) has been installed into racks at the SFU Data Centre.

Page 12: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

Cedar SpecsNode Type # Nodes Cores/

Socket# Sockets Mem

(GB)Details

Base compute 576 16 2 128 E5-2683 V4 2.1 GHz

Large compute 128 16 2 256 E5-2683 V4 2.1 GHz

Bigmem500 24 16 2 0.5 TB E5-2683 V4 2.1 GHz

Bigmem1500 24 16 2 1.5 TB E5-2683 V4 2.1 GHz

Bigmem3000 4 8 4 3 TB E7-4809 V4 2.1 GHz

GPU nodes 146 12 2 128, 256 E5-2650 V4 2.2 GHz, 4 x NVIDIA P100 GPU’s

Interconnect: Intel OmniPath (version 1) (100 Gbit/s), non-blocking within “islands”, 2:1 blocking between islandsStorage: (next slide)Vendor: Scalar Decisions (Dell, DDN, Intel)

Page 13: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

Cedar Storage

Type Estimated Size (PB)

Node Access

Allocated?(RAC)

Quota Purged

Home Mounted NO Yes No 50 GB/userCode, Configuration files

Scratch Mounted NO Yes Yes High performance, LustreDefault:

20 TB, 1M files per user,100 TB, 10M files per group

Project 10-20 Mounted YES Yes No Very large, Low performance (external), Tape backup

Nearline (tape)

None YES Yes Maybe Tape only

Project tape backup will be auto-replicated between SFU and Waterloo tape systems.

Page 14: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

Graham (Waterloo)

● Very similar to Cedar.● Essentially identical software stack and batch system.● So users can move between the two very easily.

○ RAC allocates to a particular system however● Physical mix

○ slightly different mix of small-large mem nodes and GPU nodes.● Infiniband interconnect (50 Gbit/s)

○ Non-blocking within islands, 8:1 blocking between islands● Details on CC docs:

https://docs.computecanada.ca/wiki/Graham

Page 15: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

Migration Planning

Erin TrifunovManager Projects & Outreach

WestGrid

Page 16: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

Migration Stats

➔ 19 of 45 systems across Canada will be defunded* after March 31, 2016 and will be unavailable for allocations.

➔ 875 projects (40%) have utilized the to-be-defunded systems

Thousands of users will be migrating onto the new systems○ Move datasets○ Get code working (recompile, etc.)○ Set up jobs

Goal: For all users to have superior support for their migration○ Well-documented & functional systems○ Outstanding support: Local, regional, national

Page 17: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

Who needs to migrate?

Users of systems scheduled to be “defunded” after March 31, 2017 (next slide)

Anyone else wishing to use new national systems is also welcome.

Page 18: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

WestGrid Legacy Systems Site System(s) Defunding date* Current Status

Edmonton hungabee/jasper Mar 31, 2017 Available with conditions

Victoria hermes/nestor Mar 31, 2017 Hermes virtual

Calgary breezy/lattice/parallel Mar 31, 2017 Parallel extended to Mar.31, 2018

Vancouver - UBC orcinus Mar 31, 2018 Available but with conditions

Winnipeg grex Mar 31, 2018 New storage coming

Vancouver - SFU bugaboo Mar 31, 2018 Storage support issues.

*Please note these are provisional dates.

WestGrid Migration details: https://www.westgrid.ca/migration_process For other regional systems see https://www.computecanada.ca/research-portal/accessing-resources/migration/

Page 19: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

Next Steps...

Users of legacy systems to be defunded:

1. Wait for WestGrid support to contact you.

2. Prepare files for migration:a. Clean up files & directories (DELETE unneeded files!)b. Archives & compress files/directoriesc. Transfer filesd. Verify and synchronize files

https://docs.computecanada.ca/wiki/General_directives_for_migration

Page 20: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

Migration Schedule

February March April

Mar. 15: Virtual Test System available

Apr. 13: RAC 2017 implementedMar 1: RAC 2017

award letters sent

Migration to new systemsRegions contact users required

to migrate with timeline and further instructions

Page 21: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

Schedule mitigations

Virtual TDS: Being configured now, on UVic’s Arbutus cloud➔ Suitable for internal configuration and testing➔ Not suitable for general user support, migration, etc.

User migration system➔ We hope to have a system available to users by March 15, 2017. ➔ Perhaps earlier, or perhaps with staged availability for support personnel

& selected users➔ This might be the vTDS (but expanded to support “toy”-sized parallel jobs)➔ This might be one or both of Cedar or Graham

WARNING: April may have very limited resources due to delivery delays.21

Page 22: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

Migration & New System Info

Compute Canada document wiki now available:https://docs.computecanada.ca

In particular the Migration pages:https://docs.computecanada.ca/wiki/Migration2016

And of course WestGridhttps://www.westgrid.ca

Page 23: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

Silo Migration

Sergiy StepanenkoWestGrid Site Lead

University of Saskatchewan

Page 24: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

Interim Silo Storage I

● Transfer to Waterloo COMPLETE. SFU almost done. ● Silo interim storage “Storage Building Block” (SBB):

○ Waterloo: relatively simple, low-performance NFS system.○ SFU: shared Gluster filesystem.

● Users login similar to Silo logins.○ But using National LDAP accounts. For details see:

https://docs.computecanada.ca/wiki/Migration2016:User_Accounts_and_Groups

● Backed up to the new tape systems.○ Backup currently in progress

Page 25: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

Silo Migration Stats

Silo to Waterloo completed Jan.11, 2017:● 57M files, 850TB, 140 Users.● Note: a few very large users are still transferring data from their own master copy (usually

experimental or observational data backups on silo)

Start Date for Silo to SFU Jan.11, 2017

Files transferred to SFU 21,503,022 of ~350M

Data transferred to SFU (current)

198.73 TB of ~850 TB

Users migrated to SFU 160 of ~300

Max transfer rate 650 MB/s

Current transfer rate 85 MB/s (some issues with controllers and FC network under investigation)

As of Jan.24, 2017

Page 26: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

Interim Storage Solution II

● This is an interim solution for Silo data.

● A second migration to the final storage system may be required - likely summer 2017.

● USask has agreed to keep Silo going until the migration has completed. Many thanks to USask and the other regions!

Page 27: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

Questions?

Questions for Sergiy?

Page 28: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

Growing NeedsFuture Needs, RAC, Training

& Tuques!

Lindsay SillExecutive Director

WestGrid

Page 29: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

International Comparisons

● Comparisons of giga-FLOPS (GF) per researcher ○ Canada used to be #6 (2009)○ We are now #24 (2015)

● Comparator countries for GF/researcher in charts that follow:○ US - #3 in 2015○ Germany - #5 in 2015○ Czech Republic - #10 in 2015

Page 30: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

30

International Rankings - Log ScaleGF = Gigaflop/s

Page 31: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

Continued Growth in User Base

31

Page 32: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

Resource Allocation

Process Schedule

2016 RPP Progress Report (from PI’s) Due January 5, 2017 (DONE)

CC Scientific and Technical Reviews December/January (COMPLETE)

CC Face-to-Face Review meeting February 7/8

Allocation letters to users Early March

Implementation April

Page 33: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

Growth in Number of Requests

33

Page 34: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

Resource Allocation - 2017

2017 Requests 2016 Requests % Change

Compute - CPU-years 256,000 238,000 +7.5%

Compute - GPU-years 2,660 1,357 +96%

Storage (TBs) 55,000 28,660 +92%

2017 Requested Fraction Available

2016 Requested Fraction Available

Compute - CPU 54%* 54%

Compute - GPU 38% 20%

Storage 90+% 90+%

* 54% in 2017 includes 50k+ new cores with better performance

Page 35: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

RAC Summary

35

● 2016 saw increased demand, static supply.● 2017 changes:

○ new systems (Arbutus, Cedar, Graham)● Replacing older (fragile) systems with larger more robust

systems will help ie. Codes should run much faster.● Massive user migration coincides with new system

commissioning and 2017 RAC allocation implementation.● Demand has continued to grow. 2017 will also be tough.

RAC 2017 success rate will be very similar to 2016.

Page 36: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

Training Sessions

Full details online at www.westgrid.ca/training

DATE TOPIC TARGET AUDIENCE

JAN 31 Intro to OpenMP: Part 1 Anyone

FEB 7Tools for Managing Research Data:

Intro to REDCap Anyone

FEB 8 Improving Your Visual Science Communications: Plots & Figures Anyone

FEB 9Building Research Platforms and Portals

in the Humanities & Social Sciences Humanities & Social Sciences

FEB 21 Intro to OpenMP: Part 2 Anyone

FEB 28Visualization Workshop @ University of Alberta Anyone

Page 37: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

CC Staff Awards of Excellence

Nominations Open January 20, 2017!

Nominate a team or team member and share with your community on campus

Any active full-time or part-time Compute Canada, ACENET, Calcul Québec, Compute Ontario or WestGrid team member is eligible for

nomination

Submissions due April 21, 2017

www.computecanada.ca

Page 38: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

#Tuques4Compute

National Social Media Campaign to be kicked off by Dr. Art McDonald in late January.

GOAL:Linking world-class research in Canada with ARC

Suggestions and participation welcome.

38

Page 40: WG Town Hall - WestGrid Town Hall January 27 2017_0.pdf · (GP3, Waterloo) Shipping planned for 1-st week of February. Renovations almost complete (end of January) April 2017 Network

Questions?

Webstream viewers: email your Town Hall questions to [email protected]