27
Scientific Computing Desktops to Exascale June 13, 2018 Chip Watson Scientific Computing

Scientific Computing Desktops to Exascale · Exascale encompasses 17 science applications, including Nuclear Physics Science Applications in ECP 25 QCD : Quantum Chromo-Dynamics

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scientific Computing Desktops to Exascale · Exascale encompasses 17 science applications, including Nuclear Physics Science Applications in ECP 25 QCD : Quantum Chromo-Dynamics

Scientific Computing �Desktops to Exascale�

June 13, 2018 �� �

Chip Watson�Scientific Computing

Page 2: Scientific Computing Desktops to Exascale · Exascale encompasses 17 science applications, including Nuclear Physics Science Applications in ECP 25 QCD : Quantum Chromo-Dynamics

Quick Outline

Overview of... –  Computing Trends –  Computing Resources at the lab –  Storage Resources at the lab –  Distributed Computing Plans –  Exascale Computing

Warning: watching PowerPoint presentations could be hazardous to your health

death by presentation – hiking artist.com

Page 3: Scientific Computing Desktops to Exascale · Exascale encompasses 17 science applications, including Nuclear Physics Science Applications in ECP 25 QCD : Quantum Chromo-Dynamics

Ins and Outs for this Talk

1.  Enjoy! I hope that you will find something new and interesting in this talk, and maybe learn a bit about scientific computing.

2.  Ask Questions! I’m happy to make this as interactive as you like, and won’t mind if you interrupt me or ask me something I don’t know.

Page 4: Scientific Computing Desktops to Exascale · Exascale encompasses 17 science applications, including Nuclear Physics Science Applications in ECP 25 QCD : Quantum Chromo-Dynamics

From Physics to Computing Chip’s History

–  B.S. in Physics, Georgia Tech, 1975 –  Ph.D. Experimental Nuclear Physics,

Duke, 1980 (also learned to program) –  short postdoc at Duke, then at State

University of New York Stony Brook –  joined staff at Brookhaven National

Laboratory in 1985 to work on a data acquisition system

Page 5: Scientific Computing Desktops to Exascale · Exascale encompasses 17 science applications, including Nuclear Physics Science Applications in ECP 25 QCD : Quantum Chromo-Dynamics

Computing in DAQ, Accelerator, HPC -  Joined CEBAF in 1988 to start the Data Acquisition Group -  In 1993 moved to lead the Accelerator Controls Group (team of 24) to

upgrade the controls system while the accelerator was being commissioned

-  sabbatical at CERN 1997-1998 (LHC related accelerator controls) -  returned & started the High Performance Computing Group -  when IT Division was formed in 2006, became the head of

Scientific Computing as well as Deputy CIO

CEBAF Accelerator

Page 6: Scientific Computing Desktops to Exascale · Exascale encompasses 17 science applications, including Nuclear Physics Science Applications in ECP 25 QCD : Quantum Chromo-Dynamics

Top 500 List, Performance by Year

~1000x every 6 years (except recently)

Commodity server matches #1 from 10 years ago(10 more to your pocket)

Sum

#1

#500

JLab

#1 becomes #500 in 6 years

Page 7: Scientific Computing Desktops to Exascale · Exascale encompasses 17 science applications, including Nuclear Physics Science Applications in ECP 25 QCD : Quantum Chromo-Dynamics

Top 500 Computers Single processor high performance machines disappeared in 1997 High Performance Computing Clusters now dominate the list

Page 8: Scientific Computing Desktops to Exascale · Exascale encompasses 17 science applications, including Nuclear Physics Science Applications in ECP 25 QCD : Quantum Chromo-Dynamics

Recent Top 500 Trend

Page 9: Scientific Computing Desktops to Exascale · Exascale encompasses 17 science applications, including Nuclear Physics Science Applications in ECP 25 QCD : Quantum Chromo-Dynamics

Accelerators & Advanced Chips

Page 10: Scientific Computing Desktops to Exascale · Exascale encompasses 17 science applications, including Nuclear Physics Science Applications in ECP 25 QCD : Quantum Chromo-Dynamics

Jefferson Lab Mid-Range HPC

Scientific Computing Resources summary –  Xeon Phi (KNL) Cluster (rightmost 7 racks, 440 nodes, 2016) –  GPU Cluster (45 nodes in 3 racks back side, 2012) –  2 Conventional x86 clusters (9 racks total, 450 nodes) –  IBM Tape Library (left, in the distance) –  Disk servers, interactive and admin nodes (not shown)

JLab uses the same highest performance chips as found in the Top 500 machines in the bulk of our computing resources

Page 11: Scientific Computing Desktops to Exascale · Exascale encompasses 17 science applications, including Nuclear Physics Science Applications in ECP 25 QCD : Quantum Chromo-Dynamics

“SciPhi XVI” 2016 Top500 #397 Green500 #10 4 racks of 64 nodes, 264 total 50 TB memory in 256 nodes (largest possible single job)

Integration -  Extra 8 nodes (4 for IT div) -  4 LNET routers

OmniPath to QDR Infiniband -  2 IP routers

OmniPath to QDR Infiniband

OP fabric: 48 port switches - 8 leaf switches: 32 nodes, 16 uplinks - 3 core switches (w/ extra nodes, routers) - mix of copper & fiber cabling

Xeon Phi 7230, 64 cores/node 192 GB memory, plus 16 GB high b/w memory 100 Gbps Intel OmniPath fabric 1 TB disk (O/S plus scratch)

Page 12: Scientific Computing Desktops to Exascale · Exascale encompasses 17 science applications, including Nuclear Physics Science Applications in ECP 25 QCD : Quantum Chromo-Dynamics

Just upgraded to �444 nodes last week!

Page 13: Scientific Computing Desktops to Exascale · Exascale encompasses 17 science applications, including Nuclear Physics Science Applications in ECP 25 QCD : Quantum Chromo-Dynamics

Conventional Clusters

For Experimental Nuclear Physics •  200 nodes, vintage 2012 to 2016 (dual 8 core to dual 18 core) •  Total: 4.5k scaled cores, with 1.7 to 3.0 GB memory / core •  Heavy re-use of HPC/LQCD fabric so everything is on

40 gigabit Infiniband •  Soon to add: 88 nodes, dual 20 core SkyLake Intel Xeon,

another 3.5k cores

For Lattice QCD •  247 nodes, 2012 ‘Sandy Bridge’ Intel Xeons •  Dual 8 core, 32 GB memory, QDR Infiniband (40g) •  Soon to be a shared resource for ENP and LQCD

Total Resource, Fall of 2018 > 10k cores

Page 14: Scientific Computing Desktops to Exascale · Exascale encompasses 17 science applications, including Nuclear Physics Science Applications in ECP 25 QCD : Quantum Chromo-Dynamics

Storage (1) Lustre File System (version 2.5) Ø Distributed file system (15 servers) Ø  2 (redundant) metadata servers (directory info) Ø  13 Object Storage Servers, 2.2 PB formatted

o  ZFS local file system o  Raid-z2 8+2 raid stripes o  Full raid check on read

Two name spaces, /cache and /volatile, both managed to be never full (in-house software). /cache is a write-thru cache to tape, with time delay before write to allow users to delete a file before committing it to tape; delete oldest to keep below 80% full, ZFS style quotas /volatile is similar, just skip the write to tape, delete oldest per group

Page 15: Scientific Computing Desktops to Exascale · Exascale encompasses 17 science applications, including Nuclear Physics Science Applications in ECP 25 QCD : Quantum Chromo-Dynamics

Desktops Yes, we still use desktops to do scientific computing!

Desktops give you… –  Interactive response, on demand access –  Graphical interface, visualization –  Access to all the data!

Yes, you can mount a 2 PB file system on your desktop! –  Relatively low performance

(2-8 cores, vs. 16-68 cores) –  Relatively low file bandwidth

(a meager 1 Gbps)

Page 16: Scientific Computing Desktops to Exascale · Exascale encompasses 17 science applications, including Nuclear Physics Science Applications in ECP 25 QCD : Quantum Chromo-Dynamics

Storage (2) ZFS / NFS File server Dual head server, 2 JBOD disk arrays, 0.4 PB

o  1 head for LQCD, 1 head for Experimental Physics o  ZFS local file system o  Raid-z2 5+2 raid stripes (optimized for small files) o  Full raid check on read (Lustre also uses this)

Tape Library & Servers o  IBM 3584 tape library with 4 LTO-5, 8 LTO-6, 4 LTO-7, 4 LTO-8 drives o  Servers control 1 or 2 drives, achieve full write speed o  Auto-create tar-balls on tape to reduce tape marks o  MD5 checksum stored in database on write, checked on read o  Rich set of utilities for file duplication / migration with % read-

back test on different drive

Page 17: Scientific Computing Desktops to Exascale · Exascale encompasses 17 science applications, including Nuclear Physics Science Applications in ECP 25 QCD : Quantum Chromo-Dynamics

Integration Software

Scientific Computing has two staff working on integration software •  Web apps to view system

status (right & next page) http://lqcd.jlab.org/ http://scicomp.jlab.org/

•  Custom tape library software

•  Custom disk management and file migration software (poor man’s HSM)

•  (Soon) Remote compute integration / bursting to offsite center / cloud

Page 18: Scientific Computing Desktops to Exascale · Exascale encompasses 17 science applications, including Nuclear Physics Science Applications in ECP 25 QCD : Quantum Chromo-Dynamics

Scientific Computing Web App

Page 19: Scientific Computing Desktops to Exascale · Exascale encompasses 17 science applications, including Nuclear Physics Science Applications in ECP 25 QCD : Quantum Chromo-Dynamics

Extensive monitoring software

Page 20: Scientific Computing Desktops to Exascale · Exascale encompasses 17 science applications, including Nuclear Physics Science Applications in ECP 25 QCD : Quantum Chromo-Dynamics

Gulping vs. sipping

When we see dark blue, we suspect the application is doing I/O poorly: reading in small chunks instead of big chunks. Our servers are configured for streaming large files, not handling thousands of small I/O requests. Good size: 1MB / read or write Big brother is watching! Gulp, don’t sip when running 100+ jobs!!!

Page 21: Scientific Computing Desktops to Exascale · Exascale encompasses 17 science applications, including Nuclear Physics Science Applications in ECP 25 QCD : Quantum Chromo-Dynamics

Integrating Offsite Computing Drivers: •  Experimental physics’ peaks and valleys of demand are becoming

larger: sharing with LQCD is no longer a large enough flywheel to smooth out load variations, especially as HPC nodes not always a good match

•  Provisioning to peaks is expensive (idle time wastes money)

Options: •  Send jobs to OSG, Supercomputer Centers, NERSC, Cloud, …

Considerations: •  Procurement costs, learning curve for users & for operations,

integration costs, etc. •  Wide Area Networking bandwidth constraints (today) will shrink by

2020 when ESNet upgrades us to 100g links

Page 22: Scientific Computing Desktops to Exascale · Exascale encompasses 17 science applications, including Nuclear Physics Science Applications in ECP 25 QCD : Quantum Chromo-Dynamics

Recent dips and peaks

When beam is on, load is high (late Fall, this Winter). Going forward, demand peaks will increase 4x, resources 2x.

Page 23: Scientific Computing Desktops to Exascale · Exascale encompasses 17 science applications, including Nuclear Physics Science Applications in ECP 25 QCD : Quantum Chromo-Dynamics

Viable Offsite Resources

1.  NERSC National Energy Research Scientific Computing Center -  Annual allocations from DOE / OASCR (free to us) -  Singularity to push JLab environment onto supercomputer node

2.  Open Science Grid (OSG) -  Collection of mostly University clusters, to which GlueX

collaboration has access (i.e. free) -  CERN motivated “grid” software to enable distributed job

submission and execution as well as file migration -  Singularity containers to push JLab environment to remote nodes

3.  University Clusters (usage similar to OSG)

4.  Cloud -  Infrastructure as a service, using a VPN to pull nodes from the

cloud provider into our batch system cluster -  Easy to integrate (in principle), 2x cost per node hour compared to

in-house (estimate)

Page 24: Scientific Computing Desktops to Exascale · Exascale encompasses 17 science applications, including Nuclear Physics Science Applications in ECP 25 QCD : Quantum Chromo-Dynamics

Exascale Computing Project (ECP)

Jefferson Lab’s Role in Exascale Science: Hardware and Software Developments exa: a thousand peta or a million tera For our truly largest calculations, LQCD theorists use DOE’s Leadership Computing Facilities (LCF)

Page 25: Scientific Computing Desktops to Exascale · Exascale encompasses 17 science applications, including Nuclear Physics Science Applications in ECP 25 QCD : Quantum Chromo-Dynamics

25

Exascale encompasses 17 science applications, including Nuclear Physics

Science Applications in ECP

25

QCD : Quantum Chromo-Dynamics the theory that describes the interactions of quarks and gluons that constitute the matter of the visible universe, i.e. the Standard Model (SM)

The Exascale Computing Project is funding several people at Jefferson Lab to advance the state of the art in our LQCD calculations.

4

Page 26: Scientific Computing Desktops to Exascale · Exascale encompasses 17 science applications, including Nuclear Physics Science Applications in ECP 25 QCD : Quantum Chromo-Dynamics

26

Exascale Hardware - the Path Forward

Exascale Computing, 11/1/2017

Lattice QCD is an exemplar application: •  LQCD consumes about 10% of all supercomputer resources in the US •  LQCD supplies acceptance and benchmarking codes for Supercomputer Centers •  Provides a raining engine for students/postdocs who move to industry •  Has a long history of community involvement with software & hardware design •  LQCD kernels have influenced the architectural design of chips from

IBM, Intel and NVIDIA, and we work with these silicon manufactures to help influence their design for mutual benefit

5

Page 27: Scientific Computing Desktops to Exascale · Exascale encompasses 17 science applications, including Nuclear Physics Science Applications in ECP 25 QCD : Quantum Chromo-Dynamics

27

Summary

We have a significant set of mid-range HPC systems, including computational, storage and networking. These systems are upgraded each year to pace the scientific demands (within finite budgets, of course).

Jefferson Lab is engaged is major software developments, from theory simulations to data analysis, pushing the state of the art as needed to get the science done.

Please use these systems (and our support staff) to do great science!

Exascale systems are due in 3 years. So by the time you are my age you should have that much power in your wearable tech.