15
Other Lab and Facilities R&D Projects - Fermilab Elizabeth Sexton-Kennedy and James Amundson A Coordinated Ecosystem for HL-LHC Computing R&D October 23, 2019

Other Lab and Facilities R&D Projects -Fermilab€¦ · •Evolve the Fermilab facility for future experiment needs using modern computing hardware –CMS –DUNE –Mu2e –Other

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Other Lab and Facilities R&D Projects -Fermilab€¦ · •Evolve the Fermilab facility for future experiment needs using modern computing hardware –CMS –DUNE –Mu2e –Other

Other Lab and Facilities R&D Projects - FermilabElizabeth Sexton-Kennedy and James AmundsonA Coordinated Ecosystem for HL-LHC Computing R&DOctober 23, 2019

Page 2: Other Lab and Facilities R&D Projects -Fermilab€¦ · •Evolve the Fermilab facility for future experiment needs using modern computing hardware –CMS –DUNE –Mu2e –Other

When We Were Last Here

10/23/192

• FNAL Topics

– Geant Collaborators– Dcache Collaborators– DOMA– OSG Collaborators– 2 HSF Framework

Conveners– Generator SciDAC– Provided for IF and

expertise for CMS

Page 3: Other Lab and Facilities R&D Projects -Fermilab€¦ · •Evolve the Fermilab facility for future experiment needs using modern computing hardware –CMS –DUNE –Mu2e –Other

• Evolve the Fermilab facility for future experiment needs using modern computing hardware– CMS– DUNE– Mu2e– Other Intensity Frontier experiments– Cosmic Frontier

• Support the use of external computing, including both HTC resources and HPC resources, especially Exascale

• Assist the experiments in taking advantage of advances in computing hardware and software

Software R&D Strategy

10/23/193

• Topics– Artificial Intelligence– Evolving Computing Architectures– Compute (aka HEPCloud)– Storage– Networking– Analysis Facility Concepts– Quantum Computing

• in support of Lab program• not covered here

Page 4: Other Lab and Facilities R&D Projects -Fermilab€¦ · •Evolve the Fermilab facility for future experiment needs using modern computing hardware –CMS –DUNE –Mu2e –Other

• Major strategic topic for the lab and SCD– Fulfills many purposes• Investigate better approaches to computing problems• Get compatibility with new computing hardware “for free”• Help scientists pursue research opportunities in a vibrant field• Pursue emerging funding opportunities

• Nhan Tran moved to SCD to head AI-centered group within the Artificial Intelligence and Software for Physics Applications (AISP) department

• New Computational Science Seminar series to explore collaboration opportunities with Fermilab, Argonne, and U of Chicago – First topic is AI. First talk is in November by Nhan

Artificial Intelligence

10/23/194

Page 5: Other Lab and Facilities R&D Projects -Fermilab€¦ · •Evolve the Fermilab facility for future experiment needs using modern computing hardware –CMS –DUNE –Mu2e –Other

• Many efforts– DOE Early Career Award - Deep learning for boosted Higgs program• Nhan Tran

– CompHEP: High Velocity AI (Fast inference, Distributed training, Uncertainty Quantification) [with ORNL and ANL]• 2.2 FTE including 1 postdoc + 1 intern

– LDRD: Accelerator Control with AI• 0.6 FTE, including 50% postdoc– 0.06 FTE (25K) each from ANL and ORNL (plan is to grow this to 0.15-0.2

FTE/site). 0.3 FTE of an engineer and scientist from PNNL.– LDRD: Graph Neural Networks for Calorimetry and Event Reconstruction• 0.6 FTE, postdoc shared with Exa.Trx

– LDRD: Modeling Physical Systems with Deep Learning Algorithms• 0.6 FTE, cosmic

AI Research Efforts at Fermilab

10/23/195

Page 6: Other Lab and Facilities R&D Projects -Fermilab€¦ · •Evolve the Fermilab facility for future experiment needs using modern computing hardware –CMS –DUNE –Mu2e –Other

• GPUs are the near-term big winners in ”new” computing architectures– GPUs for HTC– GPUs for HPC

• OLCF: Summit• NERSC: Perlmutter• Exascale

– ALCF: Aurora– OLCF: Frontier

• Specialized AI hardware is on the horizon• Vectorization does not look as important as it did a few years ago

Evolving Computing Architectures

10/23/196

Page 7: Other Lab and Facilities R&D Projects -Fermilab€¦ · •Evolve the Fermilab facility for future experiment needs using modern computing hardware –CMS –DUNE –Mu2e –Other

– Exa.Trx: Algorithms for tracking/classification on HPC and TDAQ (FNAL, LBNL, SLAC, Caltech)• 1.2 FTE (two shared postdocs)

– 1 FTE researcher from Caltech; 1 FTE Postdoc from U of Cincinnati– Other labs

– SciDAC: HEP Reco on Parallel Architectures• CMS Tracking and LAr Reconstruction• Working with ORNL experts on adapting algorithms to GPUs• 1.2 FTE, including two 50% postdocs

– UOregon 0.1 FTE professor; 0.5 FTE student [CMS contributes too, but they have separate funding]– SciDAC: HEP Data Analytics on HPC

• Optimization and fitting frameworks, tuning for generators, exploring HDF5 and object stores, data parallel programming with python and HPC C++ tools

• 1.8 FTE– University of Cincinnati – 1 FTE visiting scholar, .5 FTE postdoc; Colorado State University - .5 FTE

postdoc, .5 FTE grad student– Other labs

– SciDAC: Accelerator Simulations (ComPASS 4)• Accelerator simulation• Leveraging native HPC experience to share with other HEP

– e.g., Kokkos testbed

Evolving Computing Efforts

10/23/197

Page 8: Other Lab and Facilities R&D Projects -Fermilab€¦ · •Evolve the Fermilab facility for future experiment needs using modern computing hardware –CMS –DUNE –Mu2e –Other

• GeantX, ECP pilot we have:– ORNL: 1PD starting in Dec. Total 1.5 including 0.5 of a computing professional – LBNL: 0.25 FTE of Jonathan who is a PD– FNAL : 0.4 FTE of Philippe Canal – Pittsburgh: a collaborator Joe Boudreaux

• Center for Computing Excellence– Office of High Energy Physics Program with Fermilab, ANL, LBNL, and BNL– In proposal– For running applications for Dune, LHC and Cosmic on the DOE Leadership Class Facilities

(Argonne and Oak Ridge)– Thrusts (all below for HEP on High Performance Computing)

• Portable parallelization strategies• Fine-grained I/O and storage (includes data models and structures)• Event generators• Complex workflows

– Details elsewhere• CMSSW - offloading GPU work with framework– Part of existing CMSSW

Evolving Computing Architectures

10/23/198

Page 9: Other Lab and Facilities R&D Projects -Fermilab€¦ · •Evolve the Fermilab facility for future experiment needs using modern computing hardware –CMS –DUNE –Mu2e –Other

Compute• HEPCloud- Make for common cause with OSG for heterogeneous scheduling

- Concentration on support for leadership computing facilities starting with ALCF

- Evolution of our facility, especially with Rucio

- Application integration (CMS)

- First production version working• Developed under CompHEP funding

- Next-epoch development underway• 0.85 FTE

9

Page 10: Other Lab and Facilities R&D Projects -Fermilab€¦ · •Evolve the Fermilab facility for future experiment needs using modern computing hardware –CMS –DUNE –Mu2e –Other

• Software strategy involves Rucio– Rucio for CMS project - transition to Rucio for storage management• 1.25 FTE– Purdue: 0.3 FTE; UK 0.5 FTE; Italy: 0.25 FTE

– Rucio for DUNE and IF projects• Goal is to use Rucio for storage management and replace our current IF file and metadata

catalogs• Integrate Rucio into Dune workflows• 0.85 FTE, to expand to 4 FTE

• Looking toward future of mass storage architecture– Expect to move beyond Enstore (current storage system)– Exploring multi-lab 10s of exabyte proposal– Also exploring collaboration with CERN on CTA

Storage

10/23/1910

Page 11: Other Lab and Facilities R&D Projects -Fermilab€¦ · •Evolve the Fermilab facility for future experiment needs using modern computing hardware –CMS –DUNE –Mu2e –Other

• Big Data Express– Developed with DOE ASCR

networking funds

• Ongoing work with ESNet and OSG

• FNAL

Networking

10/23/1911

Page 12: Other Lab and Facilities R&D Projects -Fermilab€¦ · •Evolve the Fermilab facility for future experiment needs using modern computing hardware –CMS –DUNE –Mu2e –Other

• How would one do analysis on an Exascale machine?

• Closer to home: – Deploying analysis nodes with Kubernetes (CMS)– Coffea (HEP analysis with python tools - originated at CMS)– SciDAC HEP Analytics (HEP analysis at HPC with python/C++)

• Effort listed earlier

Analysis Facility Concepts

10/23/1912

Page 13: Other Lab and Facilities R&D Projects -Fermilab€¦ · •Evolve the Fermilab facility for future experiment needs using modern computing hardware –CMS –DUNE –Mu2e –Other

Summary• R&D Strategy aligned with Lab and DOE- Heavy emphasis on AI and Evolving Computing Architectures

• Wide variety of funding under various programs- Internal LDRDs- OHEP (CompHEP, CCE)- OHEP & ASCR (SciDAC, Exa.Trx)- CMS

• Leveraging external work (e.g. Kokkos)

• Leveraging internal work (e.g. Synergia)

13

Page 14: Other Lab and Facilities R&D Projects -Fermilab€¦ · •Evolve the Fermilab facility for future experiment needs using modern computing hardware –CMS –DUNE –Mu2e –Other

Backup

14

Page 15: Other Lab and Facilities R&D Projects -Fermilab€¦ · •Evolve the Fermilab facility for future experiment needs using modern computing hardware –CMS –DUNE –Mu2e –Other

Fermilab R&D Activities Overview

10/23/1915

• Physics and detector simulations with advanced architectures and techniques

• Accelerator Modeling on HPC• Evolution of Infrastructure Frameworks

(CMS, DUNE) and Root• HPC, Advanced architectures/accelerators,

multithreading– Containerization– HEP Data Analytics– Reconstruction– Spack & SpackDev [HPC compatible packaging]

• Machine Intelligence• Data Acquisition• Advanced networking (BigData Express)• Workflow (HEPCloud)• Astro (CCD/MKIDs)• QIS now has its own program and I won’t discuss,

but some personnel comes from SCD (myself included)

• Funding comes from many sources• DOE-OHEP• USCMS Software and Computing (S&C) Operations

Program• SciDAC-4 [DOE-ASCR] $17.5M awarded total

– 5 yr and 3 yr projects started in FY18

• Fermilab LDRD (Lab Directed R&D)• Exascale Computing Project (ECP)• HEP-CCE (Center for Computational Excellence)

– Promote excellence in HPC and R&D

– Enhance connection to ASCR– FNAL, ANL, BNL, LBNL

• Other experiment projects & Detector R&D (KA25)– e.g. CMS Outer Tracker, Mu2e TDAQ

• We supplement with SCD funds

• Personnel may be matrixed across projects