Upload
bonner
View
41
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Scientific Application Performance on Candidate PetaScale Applications. Leonid Oliker Andrew Canning, Jonathan Carter, Costin Iancu, Michael Lijewski, Shoaib Kamil, John Shalf, Hongzhang Shan, Erich Strohmaier - PowerPoint PPT Presentation
Citation preview
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
BIPSBIPS
Scientific Application Performance on Candidate
PetaScale Applications
Leonid OlikerAndrew Canning, Jonathan Carter, Costin Iancu, Michael Lijewski,
Shoaib Kamil, John Shalf, Hongzhang Shan, Erich StrohmaierLawrence Berkeley National Laboratory
Stephane EthierPrinceton Plasma Physics Laboratory
Tom GoodaleCardiff University and Louisiana State University
Winner Best Paper Application Track, IPDPS 07, Long Beach, CA, March 26-30, 2007
BIPSBIPS Overview
Stagnating application performance is well-know problem in scientific computing By end of decade numerous mission critical applications expected to have 100X
computational demands of current levels Many HEC platforms are poorly balanced for demands of leading applications
Memory-CPU gap, deep memory hierarchies, poor network-processor integration, low-degree network topology
Traditional superscalar trends slowing down Mined most benefits of ILP and pipelining,
Clock frequency limited by power concerns
BIPSBIPSDeclining Single
Processor Performance Moore’s Law
Silicon lithography will improve by 2x every 18 months
Double the number of transistors per chip every 18mo.
CMOS Power
Total Power = V2 * f * C + V * Ileakage active power passive power
As we reduce feature size Capacitance ( C ) decreases proportionally to transistor size
Enables increase of clock frequency ( f ) proportionally to Moore’s law lithography improvements, with same power use
This is called “Fixed Voltage Clock Frequency Scaling” (Borkar `99)
Since ~90nm V2 * f * C ~= V * Ileakage
Can no longer take advantage of frequency scaling because passive power (V * Ileakage ) dominates
Result is recent clock-frequency stall reflected in Patterson Graph at right
Multicore is hereSPEC_Int benchmark performance since 1978 from Patterson & Hennessy Vol 4.
BIPSBIPS Application Evaluation
Microbenchmarks, algorithmic kernels, performance modeling and prediction, are important components of understanding and improving architectural efficiency
However full-scale application performance is the final arbiter of system utility and necessary as baseline to support all complementary approaches
Our evaluation work emphasizes full applications, with real input data, at the appropriate scale
Requires coordination of computer scientists and application experts from highly diverse backgrounds
Our initial efforts have focused on comparing performance between high-end vector and scalar platforms
Currently evaluating ultra-scale systems (soon to be petascale)
NERSC 100TF/s in FY07, 500 TF/s in FY10 (XT4)
ORNL 250TF/s in FY07, 1000TF/s in FY08/09 (XT4)
ANL 100TF/s in FY07, 500TF/s in FY08/09 (BG/P)
SNL ~1000TF/s in FY 08 (XT4)
Hypothetical 1PT/s sustained (2011) will contain 1.5M-6.5M processors! Important understand algorithmic aspects allow/prevent petascale computation
BIPSBIPS Benefits of Evaluation
Full scale application evaluation lead to more efficient use of the community resources For both current installation and future designs
Head-to-head comparisons on full applications: Help identify the suitability of a particular architecture for a given
application class Give application scientists information about how well various
numerical methods perform across systems Reveal performance-limiting system bottlenecks that can aid designers
of the next generation systems.• Science Driven Architecture
In-depth studies reveal limitation of compilers, operating systems, and hardware, since all of these components must work together at scale to achieve high performance.
BIPSBIPS Application Overview
NAME Discipline Problem/Method Structure
GTC Magnetic Fusion Particle-in-Cell, Vlasov-Poisson Particle/Grid
ELB3D Fluid Dynamics Lattice Bolzmann, Navier-Stokes Lattice/Grid
CACTUS Astrophysics Theory of GR, ADM-BSSN Grid
BB3D High Energy Physics Particle-in-Cell, FFT Particle/Grid
PARATEC Materials Science Density Functional Theory, FFT Fourier/Grid
HyperClaw Gas Dynamics Hyperbolic, High-order Godunov Grid AMR
Examining set of applications with a variety of numerical methods and communication patterns with the potential to run at petascale
BIPSBIPSArchitectural Comparison
Name Node Type Where NetworkNetworkTopology
TotalProcs
CPU/Node
ClockGHz
PeakGFlop
Stream BW
GB/s/P
Stream byte/flop
MPIBW
GB/s/P
MPI Latency
sec
Bassi Power5 NERSCFederatio
nFat-tree 888 8 1.9 7.6 6.8 0.85 0.69 4.7
Jaguar Opteron ORNL XT3 3D-Torus 10,404 2 2.6 5.2 2.5 0.48 1.2 5.5
Jacquard Opteron NERSCInfiniBan
dFat-tree 640 2 2.2 4.4 2.3 0.51 0.7 5.2
BG/L PPC440 ANL Custom 3D-Torus 2,048 2 0.7 2.8 0.9 0.31 0.16 2.2
BGW PPC440 TJW Custom 3D-Torus 40,960 2 0.7 2.8 0.9 0.31 0.16 2.2
Phoenix X1E ORNL Custom 4D-Hcube 768 8 1.1 18.0 9.7 0.54 2.9 5.0
These architectures represent main stream petaflop candidate systems Power5 aggregates 4 DDR233 for high 6.8 GB/s Stream bandwidth Jaguar XT3 dual-core w/ SeaStar routing (Catamount) - future Petascale systems Jacquard single-core Opteron with non-custom communication integration BG/L, power eff, 5 networks, dual-core w/o L1 cohere, 512MB/node, 2 SIMD FPU Brief access to 20K node BG/L during BGW day Phoenix, X1E custom HPC platform, X1 upgrade (double MSP w/o BW increase) Yes Cost is a critical metric - however we are unable to provide such data
Proprietary, pricing varies based on customer and time frame
Poorly balanced systems cannot solve important problems/resolutions regardless of cost!
BIPSBIPSAstrophysics: CACTUS
Numerical solution of Einstein’s equations from theory of general relativity
Among most complex in physics: set of coupled nonlinear hyperbolic & elliptic systems with thousands of terms
CACTUS evolves these equations to simulate high gravitational fluxes, such as collision of two black holes
Evolves PDE’s on regular grid using finite differences
Visualization of grazing collision of two black holes
Developed at Max Planck Institute, vectorized/ported by John Shalf & Tom Goodale
BIPSBIPSCactus Performance:
Weak Scaling
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
16 64 256 1024 4096 8192 16384
Processors
Gflops/Processor
BassiJacquardBG/LPhoenix
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
16 64 256 1024 4096 8192 16384
Processors
Percent of Peak
BassiJacquardBG/LPhoenix
Bassi shows highest raw and sustained performance Remains to be seen if scalability will continue to thousands of processors
Jacquard shows modest scaling due to (relatively) loosely coupled nature of interconnect Phoenix X1 shows lowest % of peak - Cactus crashed on X1E
Small fraction unvectorizable code (boundary condition) leads to large penalty• Uses only 1 of 4 SSP within an MSP for an unvectorizable code portion
BG/L lowest raw performance - expected simple (power efficient) dual-issue in-order PPC440 Achieves near perfect scalability to highest concurrency ever attained Virtual node mode (32K procs) showed no performance degradation (smaller problem) Topology mapping attempts did not improve performance Overall shows potential of Cactus to run at Petascale
BIPSBIPSFluid Dynamics: ELB3D
LBM concept: develop simplified kinetic model with incorporated physics, to reproduce correct macroscopic averaged properties
Unlike explicit LBM methods, entropic approach not prone to nonlinear instabilities
ELBM develop to simulate Navier-Stokes turbulence Non-linear equation is solved at each grid point and time step
to satisfy constraints, followed by streaming of data The equation is solved via a Newton-Raphson iteration
(heavy use of log function) Spatial grid overlayed with phase-space velocity lattice Block distributed over processor grid
Developed by George Vahala’s group College of William & Mary, ported Jonathan Carter
Evolution of vorticity into turbulent structures
2
11
7
1
8
4
9 6
0
105
3
14
18
19 17
13
12
15
16
22
25
21
20
23
24
BIPSBIPSELB3D Performance:
Strong Scaling
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
64 128 256 512 1024
Processors
Gflops/Processor
BassiJacquardJaguarBG/LPhoenix
15%
17%
19%
21%
23%
25%
27%
29%
31%
64 128 256 512 1024
Processors
Percent of Peak
BassiJacquardJaguarBG/LPhoenix
Specialized log functions (ASSV, ACML) used on all platforms w/ significant improvement Shows high % of peak and good scaling across all platforms - good load balance Results show that high comp cost of entropic algorithm can be designed in efficient manner X1E attains the highest raw performance:
Innermost gridpoint loop taken inside nonlinear equation to allow for vectorization Bassi shows highest fraction of peak: high memory BW, large caches, advanced prefetch BG/L shows lowest efficiency, but value is based on peak with double hummer
Except for highly-tuned libraries, peak will almost always be 50% of potential Overall results show promising performance of ELB methods at PetaScale
BIPSBIPS
17.27
8.69
0.07
9.66
5.65 5.45
0.32 0.310.60.79
0
2
4
6
8
10
12
14
16
18
20
16xSPEs*(3.2GHz)
8xSPEs*(3.2GHz)
1xPPE*(3.2GHz)
SX8(2.0GHz)
X1E(1.13GHz)
EarthSimulator
(1GHz)
Power5(1.9GHz)
Opteron(2.2GHz)
Itanium2(1.4GHz)
BlueGene/L(0.7GHz)
Cell Vector Processors Scalar Processors
GFLOP/s
LBMHD Cell Performance
Multi-core scientific kernel optimization work by Samuel Williams (UCB/LBNL) Cell achieves impressive performance for MHD problem (explicit LB method)
Ability to explicitly control local memory (more programming complexity) Results shown only for collision phase (>>85% of overall time) Multi-blade (MPI) Cell results to come
BIPSBIPSMagnetic Fusion: GTC
Gyrokinetic Toroidal Code: transport of thermal energy (plasma microturbulence)
Goal magnetic fusion is burning plasma power plant producing cleaner energy
GTC solves 3D gyroaveraged gyrokinetic system w/ particle-in-cell approach (PIC)
PIC scales N instead of N2 – particles interact w/ electromagnetic field on grid
Allows solving equation of particle motion with ODEs (instead of nonlinear PDEs)
Vectorization inhibited since multiple particles may attempt to concurrently update same grid point
Whole volume and cross section of electrostatic potential field, showing elongated turbulence eddies
Developed at PPPL, vectorized/optimized by Stephane Ethier
BIPSBIPSGTC Performance:
Weak Scaling
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
64 128 256 512 960 1K 2K 4K 8K 10K 16K 32K
Processors
Gflops/Processor
BassiJacquardJaguarBG/LPhoenix
6%
8%
10%
12%
14%
16%
18%
64 128 256 512 960 1K 2K 4K 8K 10K 16K 32K
Processors
Percent of Peak
BassiJacquardJaguarBG/LPhoenix
Generally expect low % of peak due to scatter gather Extensive X1E optimization (such as reversing array dimensions) paid off
Although dropping at higher concurrency Three commodity superscalers (Power5, Opteron) achieve similar raw performance
However Bassi, only attains 1/2 %peak of Opteron High Opteron partially due to low main memory latency
Bassi, Jaguar, and BG/L show excellent scalability BG/L scales to 32K processors: virtual node mode only 5% slower than coprocessor
Optimizations: vector MASSV functions (60%) and mapping onto 3D Torus (30%) Results show GTC prime candidate for Petascale systems
Perfect load balancing up to 32K processors and low communication
BIPSBIPSHigh Energy Physics: BeamBeam3D
BB3D models collisions of counter-rotating charge particle beams
Code used to study beam-beam collisions at world’s highest energy accelerators
Particle-in-cell method, where particles are deposited on 3D grid to calculate charge density distribution
At collision points electric/magnetic fields calculated using Vlasov-Poisson via FFT
Particle are advances using computed fields and accelerator forces
High communication requirements: Global gather charge density Broadcast electric/magnetic fields Global FFT transpose
Developed by Ji Qiang, ported/optimized by Hongzhang Shan
GTC and BB3D both PIC w/ charged particle interactions, however: GTC requires relatively small particle movements
Allows local solve of Poisson equation BB3D has longer range particle forces
Requires high communication, global FFTs Limited number of subdomains available (2048 for test case) Pure domain decomposition not possible
BIPSBIPSBeamBeam3D Performance:
Strong Scaling
0.0
0.2
0.4
0.6
0.8
1.0
1.2
64 128 256 512 1024 2048
Processors
Gflops/Processor
BassiJacquardJaguarBG/L Phoenix
0%
1%
2%
3%
4%
5%
6%
7%
8%
64 128 256 512 1024 2048
Processors
Percent of Peak
BassiJacquardJaguarBG/L Phoenix
Phoenix attains fastest per-processor performance, however scalability is poor at high P At P=256 > 50% communication, at high P vector length decreases (fixed size problem)
At P=512 Bassi surpasses Phoenix Outperforms Opterons by 1.8x, BG/L by 4.5X
Jacquard and Jaguar show similar behavior, even though vastly different interconnects Notice all platforms achieve < 5% of peak for high concurrencies!
Indirect addressing, global all-to-all comm, extensive (non-flop) data movement BG/L achieves just 1% of peak at high concurrencies
Higher scalability experiments not possible due to limited number of domains To reach Petascale for BB3D requires extensive code reengineering:
Additional decomposition schemes, reduce communication bottleneck
BIPSBIPSMaterials Science: PARATEC
PARATEC performs first-principles quantum mechanical total energy calculation using pseudopotentials & plane wave basis set
Density Functional Theory to calc structure & electronic properties of new materials using QM principles
DFT calc are one of the largest consumers of supercomputer cycles in the world
33% 3D FFT, 33% BLAS3, 33% Hand coded F90 Part of calculation in real space other in Fourier space
Uses specialized 3D FFT to transform wavefunction QDOTs have important photo-luminscent propertiesConduction band minimum electron state for
CdSe quantum dot
Developed by Andrew Canning with Louie and Cohen’s groups (UCB, LBNL)
BIPSBIPSPARATEC Performance:
Strong Scaling
0
1
2
3
4
5
6
64 128 256 512 1024 2048
Processors
Gflops/Processor
BassiJacquardJaguarBG/LPhoenix 15%
25%
35%
45%
55%
65%
75%
64 128 256 512 1024 2048
Processors
Percent of Peak
BassiJacquardJaguarBG/LPhoenix
PARATEC attains high % peak due to 1D FFTs and BLAS3 Bassi achieves highest performance and good scaling up to 512 processors
1024 Power5 performance from LLNL Purple system Jaguar outperforms Jacquard due partially to high interconnect bandwidth BG/L lowest overall raw performance - but uses smaller system due to mem constraints
Performance drops between 512 and 1024, due to moving off topology half-plane Phoenix shows lowest % peak
Low scalar/vector ratio, vector length drops for custom F90 routines and BLAS3/FFT PARATEC scaling decreases and limited to ~2K processors - due to 1 level decomposition For Petascale must introduce second level of decomposition, over electronic band indices
Will increase scaling and reduce memory requirements for systems like BG/L
BIPSBIPSGas Dynamics: HyperCLaw
Adaptive Mesh Refinement (AMR): powerful technique to solve otherwise intractable problems - typically applied to physical systems solving PDEs
AMR dynamically refines underlying grid in regions of scientific interest Naively increasing grid resolution prohibitive in computation and memory
Price paid for additional power is complexity: significant software infrastructure Regridding, interpolation between coarse/fine grids, dynamic load balancing
Generally written in flexible/modular fashion Causing performance reduction even for unrefined grid calculations Complex communication looks more like many-to-many than stencil exchange
Berger-Colella Hyperbolic Conservation laws C++/Fortran hybrid programming model
Godunov - Advances the solution at a given refinement level TimeStep - Prepares grids at next level of Godunov solver Regrid - Replace existing grid hierarchy to maintain numerical accuracy Knapsack - Load balancing algorithm
Developed by CCSE LBNL, ported/optimized by Michael Lijewski and Michael Welcome
BIPSBIPS X1E Optimization
Altx Pwr5 X1E Altx Pwr5 X1E Altx Pwr5 X1E0
200
400
600
800
1000
1200
1400
P=32 P=64 P=128
Seconds
GodunovTstep-MPITstep-CompKnapsackRegridOther
Two X1E optimizations undertaken since our original study in CF06 Knapsack and Regridding originally prevents X1E scalability Knapsack optimized by swapping pointers instead of lists
Results in almost cost free X1E Knapsack algorithm Regridding originally required O(N^2) box list intersection calculation
Updated hashing scheme reduced overhead to O(NlogN) Significantly reduced X1E regridding overhead
BIPSBIPSHyperCLaw Performance:
Weak Scaling
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
16 32 64 128 256 512 1024
Processors
Gflops/Processor
BassiJacquardJaguarBG/L Phoenix
0%
1%
2%
3%
4%
5%
6%
16 32 64 128 256 512 1024
Percent of Peak
Percent of Peak
BassiJacquardJaguarBG/L Phoenix
Due to code complexity, all platforms achieve < 5% of peak Increasing adaptivity level introduces grid management overheads, reducing performance
Bassi shows highest raw performance, BG/L the lowest Jacquard and Jaguar shoe similar raw performance X1E shows lowest fraction of peak ~1%, before optimization this was much lower
Hierarchical data structures management is scalar work Higher concurrency reduces vector length AMR designed for small grid patches, causes short vector length Adding more complexity (elliptic, chemistry) would certainly reduce vector performance May be possible to design AMR parallel vector code, but only from the ground up
Despite low efficiency, systems show good scalability - potential for petascale systems
BIPSBIPS Performance Summary
67% 45% 43%55%
0%
5%
10%
15%
20%
25%
30%
Bassi Pwr5
Jacquard Opteron
Jaguar Opteron
BG/L PPC440
Phoenix X1E (MSP)
Percent of Peak
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Bassi Pwr5
Jacquard Opteron
Jaguar Opteron
BG/L PPC440
Phoenix X1E (MSP)
Relative Performance
HCLaw (P=128)BB3D (P=512)Cactus (P=256)GTC (P=512)ELB3D (P=512)PARATEC (P=512)AVERAGE
Evaluating HPC on realistic problems is complex task, requires diverse group domain scientists Work presents one of most extensive performance analyses to date
Overall Power5 Bassi achieves highest raw and sustained performance High mem BW, tight interconnect integration, latency hiding via advanced prefetching
Phoenix X1E achieved high perf for GTC and ELB3D, but showed poor results for several codes Opteron systems show similar performance
However, Jaguar’s XT3 interconnect outperformed Jacquard for GTC & PARATEC BG/L lowest raw/sustained perf, but achieved unprecedented concurrency on several codes Potential to run at petascale:
GTC, Cactus - very encouraging, linear scaling to 32K processors ELB3D, Hyperclaw - promising scaling at lower concurrencies BB3D and Paratec - require reengineering to incorporate additional levels of parallelism
BIPSBIPSComparing Jaguar single vs dual core
-30%
-25%
-20%
-15%
-10%
-5%
0%
5%
10%
15%
20%
ELBM3D BB3D Paratec Hclaw GTC Cactus Avg
Percent Slowdown/Speedup Performance vs using one oftwo 2.6 GHz Jaguar cores
Performance vs older single-core 2.4 GHz system
Preliminary performance comparison shows that, for the most part, having a dual-core architecture does not significantly inhibit performance compared to a single core.
This is not the case with BB3D where extensive communication and memory transfer but high strain on a sockets resources.
Understanding and optimizing multi-core systems is one of greatest challenges in HPC and industrial computing today
BIPSBIPS Future and Related Work
Future Evaluation Work: Explore more complex methods and irregular data structures Continue evaluating leading HPC systems as they come online Perform in-depth application characterizations
Related research activities: Developing an application-derived I/O benchmark based on CMB Investigating automatic tuning for stencil computations Investigating scientific kernel performance on multi-core systems Investigating potential of dynamically reconfigurable interconnects
to achieve fat-tree performance at fraction of switch components
Papers at http://crd.lbl.gov/~oliker