Sequoia and the Petascale Era

Lawrence Livermore National Laboratory

Thomas SpelceDevelopment Environment Group

LLNL-PRES-411030

Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551This work performed under the auspices of the U.S. Department of Energy by

Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Sequoia and the Petascale Era

SCICOMP 15May 20, 2009

2

Lawrence Livermore National Laboratory10th International LCI ConferenceLLNL-PRES-411030

The Advanced Simulation and Computing (ASC) Programdelivers high confidence prediction of weapons behavior

Integrated Codes

Physics and Engineering Models

Verification andValidation

Codes to predictsafety and reliability

Models andunderstanding NNSA Science Campaigns

Experiments Legacy UGTs

Experiments providecritical validation data

ASC integrates all of the science and engineering that makes stewardship successful

3


ASC pursued three classes of systems to cost effectivelymeet current (and anticipate future) compute requirements

Capability systems ==> the mostchallenging integrated design calculations• More costly but proven• Production workload

Capacity systems ==> day to day work• Less costly, somewhat less reliable• Throughput for less demanding

problems

Advanced Architectures ==>performance, power consumption, etc.• Targeted but demanding workload• Tomorrow’s mainstream solutions?

The “three curves” (Capability, Capacity and Advanced Architectures) approach hasbeen successful in delivering good cost performance across the spectrum of need…

Performance

Time

FY01 FY05

Purple

MCRWhite

Q

Peloton

TLCC (Juno)

BlueGene/L

Roadrunner

Red

Blue

Sequoia

Low-cost capacity

Original concept:develop capability

Mainframes (RIP)

Thunder

Higher performance andlower power consumption

4


Sequoia represents largest increase in computationalpower ever delivered for NNSA Stockpile Stewardship

1/06 7/06 12/06

1/10 7/10 12/10

Market Survey

CD0 Approved

CD1 Approved Selection

1/07 7/07 12/07

1/08 7/08 12/08

Contract PackageSequoia Plan Review

Dawn Early Science Transition to Classified Dawn GA

Write RFP

Sequoia Build Decision

Sequoia Parts Commit & Option Sequoia Parts Build

Sequoia Early Science Transition to Classified Sequoia Operational ReadinessCD4 Approved

Sequoia Five Years Planned Lifetime Through CY17

Sequoia contract award

Phased System Deliveries

Sequoia final system acceptance

1/12 7/12 12/12

1/11 7/11 12/11

Sequoia Demo

Dawn Phase 1 Dawn Phase 2

Dawn system acceptance

Vendor Response

CD2/3 ApprovedDawn LA

1/09 12/097/09

5


“Dawn speeds a man on his journey, and speeds him too in his work” ...Hesiod (~700 B.C.E)

Dawn Specifications• IBM BG/P architecture• 36,864 compute nodes (500TF)• 147,456 PPC 450 cores• 4GB memory per node (147.5TB)• 128-to-1 compute to I/O node ratio• 288 10GE links to file system

Dawn Installation• Feb 27th - final rack delivery• March 5th - 36 Rack integration complete• March 15-24th – Synthetic WorkLoad start• End of March - Acceptance (planned)

ibm.com/systems/deepcomputing/bluegene/

6


14 TF/s4 TB36 KW

Rack

36 racks0.5 PF/s144 TB1.3 MW>8 Day MTBF

System

13.6 GF/s4.0 GB DDR213.6 GB/s Memory BW0.75 GB/s 3D Torus BW

Compute Card

850 MHz PPC 4504 cores/4 threads13.6 GF/s Peak8 MB EDRAM

Chip

435 GF/s128 GB

Node Card

DAWN SEQUOIA Initial Delivery

7


288 – 10GbE

14 – 1GbE

14 – 1GbE

2 – 1GbE

3 – 1GbE

1 – 10GbE

2 – 1GbE 4 x 4 –1GbE

8 x 4 –10GbE

Dawn Core (9 x 4 BG/P Racks)

144 – 1GbE

Primary Backup 2 – 1GbE

2 – FC4

2 – 10GbE12 –10GbE

2 – FC4 2 – FC4 2 – FC4

2 – 10GbE2 – 10GbELocal Disk

SERVICE SERVICE HMCSERVICELOGIN

HTC

LLNL

DAWN Initial Delivery Infrastructure

E-netCore

8


Sequoia Target Architecture and Infrastructure

Production Operation FY12-FY17• 20PF/s, 1.6 PB Memory• 96 racks, 98,304 nodes• 1.6 M cores (1 GB/core)• 50 PB Lustre file system• 6.0 MW power (160 times

more efficient than Purple)

Will be used as a 2D ultra-resand 3D high-res UncertaintyQuantification (UQ) engine

Will be used for 3D sciencecapability runs exploring keymaterials science problems

9


High performance material science simulations willcontribute directly to ASC programmatic success

Six physics/materials science applications targetedfor early implementation on Sequoia infrastructure• Qbox – Quantum molecular dynamics for

determination of material equation of state• DDCMD – Molecular dynamics for material

dynamics• Miranda – 3D Continuum fluid dynamics for

interfacial mixing• ALE3D – 3D Continuum mechanics for ignition

and detonation propagation of explosives• LAMPPS – Molecular dynamics for shock

initiation in high explosives• ParaDiS – Dislocation dynamics for high

pressure strength in materials

10


Single Sequoia Platform Mandatory Requirement is P ≥ 20

P is “peak” of the machine measured in petaFLOP/s Target requirement is P + S ≥ 40

• S is weighted average of five “marquee” benchmark codes• Four code package benchmarks

− UMT, IRS, AMG, and SPhot− Program goal is 24x the Purple capability throughput

• One “science workload” benchmark from SNL− LAMMPS (molecular dynamics)− Program goal is 20x-50x BGL for science capability

Purple - 100TF/sPurple - 100TF/s BlueGene /L – 367TF/sBlueGene /L – 367TF/s

11


Sequoia Operating System Perspective

1-N CN… Light weight kernel on compute nodes Optimized for scalability and reliability

As simple as possible Extremely low OS noise Direct access to interconnect hardware

OS features Linux/Unix syscall compatible w/ I/O syscalls Support for dynamic lib runtime loading Shared memory regions

Open source

Linux/Unix OS on I/O nodes Leverage large Linux/Unix base & community

Enhance TCP offload, PCIe, I/O Standard File Systems - Lustre, NFSv4, etc. Aggregates N CN for I/O & admin Open source

Compute Nodes

Sequoia ION and InterconnectLinux/Unix

FSD Perf tools totalview

Lustre Client NFSv4

SLURMD

MPI

Application

GLIBC

Sequoia CN and Interconnect

NPTL Posix threadsglibc dynamic loading

ADI

hardware transport

RASFutexsyscallsShared

Memory

MPI

Application

GLIBC



ADI

hardware transport


Memory

MPI

Application

GLIBC



ADI

hardware transport


Memory

MPI

Application

GLIBC


Posix threads, OpenMP, SE/TMglibc dynamic loading

ADI

hardware transport

RASFunction Shipped

syscalls SMP

UDP TCP/IPLNet

Function Shipped

syscalls

12


Sequoia Software Stack – Applications Perspective

Code Development Tools

C/C++/FortranCompilers, Python

LWK

, Lin

ux/U

nix

Opt

imiz

ed M

ath

Libs APPLICATION

IP

UDPTCP

SOCKETSLustre Client

Clib/F03 runtime

MPI2

Interconnect Interface

User Space Kernel Space

ADI

Parallel Math Libs

External Network

LNet

OpenMP, Threads, SE/TM

Function Shippedsyscalls

SLUR

M/M

oab

RAS,

Con

trol S

yste

mCo

de D

ev T

ools

Infra

stru

ctur

e

13


The tools that users know and love will be available onSequoia with improvements and additions as needed

InfrastructureDebugging Performance Features

Ope

ratio

nal S

cale

DyninstPAPIStack

Walker

OpenMP ProfilingInterface

MRNet

PMPI

APAIDPCL

Valgrind

OTF

SE/TMMonitor

LaunchMONSTAT TV memlight

memP

TotalView

ThreadCheck

MemCheck

SE/TMDebugger

New LightweightFocus Tools

mpiP

TAUO|SS

OpenMPAnalyzer

gprof

SE/TMAnalyzer

105 -

106 -

107 -

104 -

1 -

Existing

New

14


Application programming requirements and challenges

Availability of 1.6M cores pushes all-MPI codes to extreme concurrency

Availability of many threads on manySMP cores encourages low-levelparallelism for higher performance

Mixed MPI/SMP programmingenvironment and possibility ofheterogeneous compute distributionbrings load imbalance to the fore

I/O and visualization requirementsencourage innovative strategies tominimize memory and bandwidthbottlenecks

MPIScalingMPIScaling

SMPThreadsSMPThreads

I/O &VisualizationI/O &Visualization

HybridModelsHybridModels

15


MP

I_FI

NA

LIZE

The RFP asked interested vendors to addressa “Unified Nested Node Concurrency” model

MPI Tasks on a node are processes (one is shown) with multiple OS threads(Thread0-3 shown)

Thread0 is “Main thread” & Thread1-3 are helper threads that morph from Pthreadto OpenMP worker to TM/SE compiler generated threads via runtime support

Hardware support to significantly reduce overheads for thread repurposing andOpenMP loops and locks

MA

IN

Thread0Thread1Thread2Thread3

MP

I_IN

IT

Func

t1

MP

I Cal

l1-3

MP

I Cal

l

Func

t2

MP

I Cal

l

MP

I Cal

l

TM/S

E

TM/S

E

Ope

nMP

1-3

Func

t1

MP

I Cal

l

1-3

MP

I Cal

l

MA

IN

Exi

t

Ope

nMP

Ope

nMP

Ope

nMP

1) Pthreads born with MAIN2) Only Thread0 calls functions to nest parallelism3) Pthreads based MAIN calls OpenMP based Funct14) OpenMP Funct1 calls TM/SE based Funct25) Funct2 returns to OpenMP based Funct16) Funct1 returns to Pthreads based MAIN

MP

I Cal

l

WWW

WWW

1-3 1-3 1-3 1-3

16


Previous systems have prepared the way for Sequoia

BG/L experience informs Dawn/Sequoia scalability OpenMP & Posix threads experience on Linux/AIX Integrated codes regularly run at Purple capability Dawn will be used for code development

• SMP parallelism• Python• Larger memory per core than BG/L• Some critical UQ analysis as well

Sequoia will be a Tri-Lab ASC resource• Video conferences for coordination

DAWN Initial Delivery

17


A diverse team and a new Scalable ApplicationPreparation Project ensure success on Sequoia

LC Hotline, User Training and Documentationaddress routine issues

ADEPT team provides expertise in compilers,debuggers, performance tools

Access to IBM experts, including an on-site IBMapplications analyst

Staff to work closely with the application teams Ongoing ANL/IBM/LLNL BlueGene collaboration Engaging third-party vendors, university research

partners, and the open source community

18


New Petascale Computing Enabling Technologies (PCET)LDRD is addressing key barriers to predictive simulation

Debugging103 Cores Load Balance

104 Cores Fault Tolerance

105 Cores Multicore106 Cores

Vector FP Units/Accelerators?

107 Cores Power?108 Cores

Purple

BG/L

PetascaleExascale

PCET creates essential capabilities for exascale core counts

19


PCET strategy mitigates risk to assure immediateimpact on application drivers and longer term success

ShorterTerm Payoff

Load balanceanalysis

Cache obliviousdata layouts Checkpoint

compression

Behavioraland performance

equivalenceclasses

Petascale capable& Exascale prepared

Multicore-aware algorithmsApplication-level fault toleranceWell-balanced application load

Automated error analysis

Current capabilitiesMPI large grain parallelismBasic checkpoint/restart

Ill-defined load imbalancesDebugging < 4096 cores

Terascale capabilitiesMulticore-adapted algorithms

Faster checkpoint/restartUnderstood load imbalances

Targeted debugging

20


Take-away: Computational science on Sequoia at full-scale will be culmination of many years of hard work

Innovative orevolutionary

architecture ideas

R&Dcontracts

Flexible contracts with targets as requirements

Milestoneprogress

Initial delivery& integration

Computationalscience R&D

Periodicreviews

Rigorousreview

We’rehere withDawn ID

Documents

Sequoia and the Petascale Era