Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Charm++Migratable Objects + Asynchronous Methods + Adaptive Runtime
=Performance + Productivity
Laxmikant V. Kale∗
Anshu Arya Nikhil Jain Akhil Langer Jonathan LifflanderHarshitha Menon Xiang Ni Yanhua Sun Ehsan Totoni
Ramprasad Venkataraman∗ Lukasz Wesolowski
Parallel Programming LaboratoryDepartment of Computer Science
University of Illinois at Urbana-Champaign
∗{kale, ramv}@illinois.edu
SC12: November 13, 2012Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 1 / 37
Benchmarks
Required
1D FFT
Random Access
Dense LU Factorization
Optional
Molecular Dynamics
Adaptive Mesh Refinement
Sparse Triangular Solver
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 2 / 37
Metrics: Performance and ProductivityOur Implementations in Charm++
Productivity Performance
Code C++ CI Benchmark Driver Total Machine Max Performance HighlightSubtotal Cores
1D FFT 54 29 83 102 185 IBM BG/P 64K 2.71 TFlop/sIBM BG/Q 16K 2.31 TFlop/s
Random Access 76 15 91 47 138 IBM BG/P 128K 43.10 GUPSIBM BG/Q 16K 15.00 GUPS
Dense LU 1001 316 1317 453 1770 Cray XT5 8K 55.1 TFlop/s (65.7% peak)
Molecular Dynamics 571 122 693 n/a 693 IBM BG/P 128K 24 ms/step (2.8M atoms)IBM BG/Q 16K 44 ms/step (2.8M atoms)
Triangular Solver 642 50 692 56 748 IBM BG/P 512 48x speedup on 64 coreswith helm2d03 matrix
AMR 1126 118 1244 n/a 1244 IBM BG/Q 32k 22 timesteps/sec, 2d mesh,max 15 levels refinement
C++ Regular C++ codeCI Parallel interface descriptions and control flow DAG
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 3 / 37
CapabilitiesDemonstrated Productivity Benefits
Automatic load balancing
Automatic checkpoints
Tolerating process failures
Asynchronous, non-blocking collective communication
Interoperating with MPI
For more info
http://charm.cs.illinois.edu/
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 4 / 37
Capabilities: Automated Dynamic Load Balancing
Measurement based fine-grained load balancing
I Principle of persistence - recent past indicates near future.I Charm++ provides a suite of load-balancers.
How to use?I Periodic calls in application - AtSync().I Command line argument - +balancer Strategy.
MetaBalancer - When and how to load balance?
I Monitors the application continuously and predicts behavior.I Decides when to invoke which load balancer.I Command line argument - +MetaLB
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 5 / 37
Capabilities: Automated Dynamic Load Balancing
Measurement based fine-grained load balancing
I Principle of persistence - recent past indicates near future.I Charm++ provides a suite of load-balancers.
How to use?I Periodic calls in application - AtSync().I Command line argument - +balancer Strategy.
MetaBalancer - When and how to load balance?
I Monitors the application continuously and predicts behavior.I Decides when to invoke which load balancer.I Command line argument - +MetaLB
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 5 / 37
Capabilities: Automated Dynamic Load Balancing
Measurement based fine-grained load balancing
I Principle of persistence - recent past indicates near future.I Charm++ provides a suite of load-balancers.
How to use?I Periodic calls in application - AtSync().I Command line argument - +balancer Strategy.
MetaBalancer - When and how to load balance?
I Monitors the application continuously and predicts behavior.I Decides when to invoke which load balancer.I Command line argument - +MetaLB
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 5 / 37
Capabilities: Checkpointing Application State
Checkpointing to disk for split executionCkStartCheckpoint(callback)
I Designed for applications need to run for a long period, but cannot getall the allocation needed at one time.
Restart applications from checkpoint on any number of processors
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 6 / 37
Capabilities: Tolerating Process Failures
Double in-memory checkpointing for online recoveryCkStartMemCheckpoint(callback)
I To tolerate the more and more frequent failures in HPC system.
Injecting failure and automatically detection of failuresCkDieNow()
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 7 / 37
Capabilities: InteroperabilityInvoke Charm++ from MPI
Callable like other external MPI librariesUse MPI communicators to enable the following modes
MPI
Charm++Time...
P(1) P(2) P(N-1) P(N)
...
P(1) P(2) P(N-1) P(N)
...
P(1) P(2) P(N-1) P(N)
(a) Time Sharing (b) Space Sharing (c) Combined
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 8 / 37
Capabilities: InteroperabilityTrivial Changes to Existing Codes
Initialize and destroy Charm++ instances
Use interface functions to transfer control
//MPI Init and other basic initialization{ optional pure MPI code blocks }
//create a communicator for initializing Charm++MPI Comm split(MPI COMM WORLD, peid%2, peid, &newComm);CharmLibInit(newComm, argc, argv);
{ optional pure MPI code blocks }
//Charm++ library invocationif(myrank%2)
fft1d(inputData,outputData,data size);
//more pure MPI code blocks//more Charm++ library calls
CharmLibExit();//MPI cleanup and MPI Finalize
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 9 / 37
Capabilities: Asynchronous, Non-blocking CollectiveCommunication
Overlap collective communication with other work
Topological Routing and Aggregation Module (TRAM)I Transforms point-to-point communication into collectivesI Minimal topology-aware software routingI Aggregation of fine-grained communicationI Recombining at intermediate destinations
Intuitive expression of collectives through overloading constructs forpoint-to-point sends (e.g. broadcast)
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 10 / 37
FFT: Parallel Coordination CodedoFFT()
for(phase = 0; phase < 3; ++phase) {atomic {
sendTranspose();
}for(count = 0; count < P; ++count)
when recvTranspose[phase] (fftMsg *msg) atomic {applyTranspose(msg);
}if (phase < 2) atomic {
fftw execute(plan);
if(phase == 0)
twiddle();
}}
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 11 / 37
FFT: PerformanceIBM Blue Gene/P (Intrepid), 25% memory, ESSL /w fftw wrappers
256 512 1024 2048 4096 8192 16384 32768 6553610
1
102
103
104
Cores
GF
lop/
s
P2P All−to−allMesh All−to−allSerial FFT limit
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 12 / 37
FFT: PerformanceIBM Blue Gene/P (Intrepid), 25% memory, ESSL /w fftw wrappers
256 512 1024 2048 4096 8192 16384 32768 6553610
1
102
103
104
Cores
GF
lop/
s
P2P All−to−allMesh All−to−allSerial FFT limit
Charm++ all-to-all using TRAM
Asynchronous, Non-blocking, Topology-aware, Combining, Streaming
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 12 / 37
Random Access
Productivity
Use point to point sends and let Charm++ optimize communication
Automatically detect and adapt to network topology of partition
Performance
Automatic communication optimization using TRAMI Aggregation of fine-grained communicationI Minimal topology-aware software routingI Recombining at intermediate destinations
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 13 / 37
Random Access: PerformanceIBM Blue Gene/P (Intrepid), BlueGene/Q (Vesta)
0.0625
0.25
1
4
16
64
128 512 2K 8K 32K 128K
GU
PS
Number of cores
43.10
Perfect ScalingBG/PBG/Q
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 14 / 37
LU: Capabilities
Composable libraryI Modular program structureI Seamless execution structure (interleaved modules)
Block-centricI Algorithm from a block’s perspectiveI Agnostic of processor-level considerations
Separation of concernsI Domain specialist codes algorithmI Systems specialist codes tuning, resource mgmt etc
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 15 / 37
LU: Capabilities
Composable libraryI Modular program structureI Seamless execution structure (interleaved modules)
Block-centricI Algorithm from a block’s perspectiveI Agnostic of processor-level considerations
Separation of concernsI Domain specialist codes algorithmI Systems specialist codes tuning, resource mgmt etc
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 15 / 37
LU: Capabilities
Composable libraryI Modular program structureI Seamless execution structure (interleaved modules)
Block-centricI Algorithm from a block’s perspectiveI Agnostic of processor-level considerations
Separation of concernsI Domain specialist codes algorithmI Systems specialist codes tuning, resource mgmt etc
Lines of Code Module-specificCI C++ Total Commits
Factorization 517 419 936 472/572 83%Mem. Aware Sched. 9 492 501 86/125 69%Mapping 10 72 82 29/42 69%
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 15 / 37
LU: Capabilities
Flexible data placementI Experiment with data layout
Memory-constrained adaptive lookahead
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 16 / 37
LU: PerformanceWeak Scaling: (N such that matrix fills 75% memory)
0.1
1
10
100
128 1024 8192
TotalTFlop/s
Number of Cores
Theoretical peak on XT5Weak scaling on XT5
67%
67.4%
67.4%
67.1%
66.2%
65.7%
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 17 / 37
LU: Performance... and strong scaling too! (N=96,000)
0.1
1
10
100
128 1024 8192
TotalTFlop/s
Number of Cores
Theoretical peak on XT5Weak scaling on XT5
Theoretical peak on BG/PStrong scaling on BG/P
60.3%
45%
40.8%31.6%
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 18 / 37
Optional BenchmarksWhy MD, AMR and Sparse Triangular Solver
Relevant scientific computing kernels
Challenge the parallelization paradigmI Load imbalancesI Dynamic communication structure
Express non-trivial parallel control flow
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 19 / 37
LeanMDSLOC: 693
1 Mimics short-range force calculation in NAMD2 Resembles miniMD of Mantevo project (SLOC≈3000)3 Advanced features:
Metabalancer: automated dynamic load balancingFault tolerance: in-memory checkpointing based restartSplit Execution: checkpoint on x cores, restart on y cores
1 Away Decomposition 2 AwayX Decomposition
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 20 / 37
Code for FT and LB
if (stepCount % ldbPeriod == 0) {serial { AtSync(); }when ResumeFromSync() { }
}
if (stepCount % checkptFreq == 0) {serial {
//coordinate to start checkpointingcontribute(CkCallback(CkReductionTarget(Cell,startCheckpoint),thisProxy(0,0,0)));
}if (thisIndex.x == 0 && thisIndex.y == 0 && thisIndex.z == 0) {
when startCheckpoint() serial {CkCallback cb(CkReductionTarget(Cell,recvCheckPointDone), thisProxy);if (checkptStrategy == 0) CkStartCheckpoint(logs.c str(), cb);else CkStartMemCheckpoint(cb);
}}when recvCheckPointDone() { }
}
//kill one of processes to demonstrate fault toleranceif (stepCount == 30 && thisIndex.x == 1 && thisIndex.y == 1 && thisIndex.z == 0) serial {
if (CkHasCheckpoints()) {CkDieNow();
}}
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 21 / 37
Code for FT and LB
if (stepCount % ldbPeriod == 0) {serial { AtSync(); }when ResumeFromSync() { }
}
if (stepCount % checkptFreq == 0) {serial {
//coordinate to start checkpointingcontribute(CkCallback(CkReductionTarget(Cell,startCheckpoint),thisProxy(0,0,0)));
}if (thisIndex.x == 0 && thisIndex.y == 0 && thisIndex.z == 0) {
when startCheckpoint() serial {CkCallback cb(CkReductionTarget(Cell,recvCheckPointDone), thisProxy);if (checkptStrategy == 0) CkStartCheckpoint(logs.c str(), cb);else CkStartMemCheckpoint(cb);
}}when recvCheckPointDone() { }
}
//kill one of processes to demonstrate fault toleranceif (stepCount == 30 && thisIndex.x == 1 && thisIndex.y == 1 && thisIndex.z == 0) serial {
if (CkHasCheckpoints()) {CkDieNow();
}}
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 21 / 37
Code for FT and LB
if (stepCount % ldbPeriod == 0) {serial { AtSync(); }when ResumeFromSync() { }
}
if (stepCount % checkptFreq == 0) {serial {
//coordinate to start checkpointingcontribute(CkCallback(CkReductionTarget(Cell,startCheckpoint),thisProxy(0,0,0)));
}if (thisIndex.x == 0 && thisIndex.y == 0 && thisIndex.z == 0) {
when startCheckpoint() serial {CkCallback cb(CkReductionTarget(Cell,recvCheckPointDone), thisProxy);if (checkptStrategy == 0) CkStartCheckpoint(logs.c str(), cb);else CkStartMemCheckpoint(cb);
}}when recvCheckPointDone() { }
}
//kill one of processes to demonstrate fault toleranceif (stepCount == 30 && thisIndex.x == 1 && thisIndex.y == 1 && thisIndex.z == 0) serial {
if (CkHasCheckpoints()) {CkDieNow();
}}
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 21 / 37
MD: Performance1 million atoms. IBM Blue Gene/P (Intrepid)
10
100
1000
10000
2k 4k 8k 16k 32k 64k 128k
Tim
e pe
r st
ep (
ms)
Number of cores
Performance on Intrepid (2.8 million atoms)
No LBHybrid LB
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 22 / 37
Fault Tolerance Support for MDMD: Checkpoint Time
0
10
20
30
40
50
60
2048 4096 8192 16384 32768
Tim
e(m
s)
Number of processes
LeanMD Checkpoint Time on BlueGene/Q
2.8 million1.6 million
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 23 / 37
Fault Tolerance Support for MDMD: Restart Time
0
20
40
60
80
100
120
140
160
2048 4096 8192 16384 32768
Tim
e(m
s)
Number of processes
LeanMD Restart Time on BlueGene/Q
2.8 million1.6 million
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 24 / 37
Meta-Balancer vs Periodic Load Balancing
32
64
128
256
512
1024
2048
64 128 256 512 1024
Elap
sed
time
(s)
LB Period
Elapsed time vs LB Period (BlueGene/P)
8k cores16k cores32k cores
64k cores128k cores
Cores No LB (s) Periodic LB (s) Meta-Balancer (s)
8k 666 504 41316k 336 260 27732k 171 131 13164k 122 104 100
128k 73 54 52
Frequent load balancingincreases total executiontime
Infrequent load balancingleads to load imbalanceand results in no gains
Meta-Balancer adaptivelyperforms load balancingto obtain best totalexecution time
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 25 / 37
Sparse Triangular Solver- Matrix Decomposition
Column Decomposition
Dense Parts Decomposition
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 26 / 37
Sparse Triangular Solver- Parallel Algorithm
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 27 / 37
Sparse Triangular Solver- Performance vs. SuperLU DIST
0.001
0.01
0.1
1
10
1 2 4 8 16 32 64 128 256 512
Solu
tion
Tim
e (s
)
Number of Cores
SuperLU_webbase-1MSuperLU_helm2d03SuperLU_largebasis
SuperLU_hoodslu_largebasis
slu_webbase-1Mslu_hood
slu_helm2d03
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 28 / 37
Sparse Triangular Solver- Productivity in Charm++
More complicated (higher performance) algorithm in 692 total SLOCsI vs. 897 SLOCs of SuperLU DIST triangular solver
Overdecomposition (with Round-Robin mapping) is essentialI Communication computation overlap, load balance
Creation of parallel units dynamicallyI Distributing dense regions
Message-driven nature and prioritiesI No need for something like MPI Iprobe
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 29 / 37
Adaptive Mesh Refinement
Sample simulation
Propagation of refinement decision messages
Stay Refine
d d + 1
Coarsen
d - 1
Coarsend - 1 d
Stay
d + 1
Sibling d
d + 2
d - 1 d d + 1
Refined + 2
Initial state
Received message
d
dRequired depth
Decision
Local error condition
Termination detection
*
Refine
Coarsen,Stay
*
Finite state machine for each block’s decision update
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 30 / 37
Adaptive Mesh Refinement
Sample simulation
Propagation of refinement decision messages
Stay Refine
d d + 1
Coarsen
d - 1
Coarsend - 1 d
Stay
d + 1
Sibling d
d + 2
d - 1 d d + 1
Refined + 2
Initial state
Received message
d
dRequired depth
Decision
Local error condition
Termination detection
*
Refine
Coarsen,Stay
*
Finite state machine for each block’s decision update
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 30 / 37
Adaptive Mesh Refinement
Charm++ Implementation
Blocks as virtual processors instead of each process containing many blocksI simplifies implementation
Blocks addressed with bit vector indicesI CharmRTS handles physical locations
Dynamic distributed load balancing
Algorithmic Improvements1
O(#blocks/P ) vs O(#blocks) memory per process
2 system quiesence states vs #level reductions for mesh restructuring
O(1) vs O(logP ) time neighbor lookup
1Langer, et.al. Scalable Algorithms for Distributed Memory Adaptive Mesh Refinement. In 24th International
Conference on Computer Architecture and High Performance Computing (SBAC-PAD) 2012, NY, USA.
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 31 / 37
Adaptive Mesh Refinement
0.1
1
10
100
256 512 1k 2k 4k 8k 16k 32k
Tim
este
ps /
sec
Number of Ranks
IBM BG/Q Min−depth 4IBM BG/Q Min−depth 5
Timesteps per second strong scaling on IBM BG/Q with a
max depth of 15.
0
5
10
15
20
25
30
35
16 32 64 128 256 512 1k 2k 4k 8k 16k 32k
Rem
eshi
ng L
aten
cy T
ime
(ms)
Number of Ranks
Depth Range 4−9Depth Range 4−10Depth Range 4−11
The non-overlapped delay of remeshing in milliseconds on
IBM BG/Q. The candlestick graphs the minimum and maxi-
mum values, the 5th and 95th percentile, and the median.
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 32 / 37
Charm++ at SC12
Temperature-aware load balancing Tue @ 2:00 pm
NAMD at 200K+ cores Thu @ 11:00 am
For more info
http://charm.cs.illinois.edu/
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 33 / 37
Charm++Programming Model
Object-based Express logic via indexed collections ofinteracting objects (both data and tasks)
Over-decomposed Expose more parallelism than availableprocessors
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 34 / 37
Charm++Programming Model
Runtime-Assisted scheduling, observation-based adaptivity,load balancing, composition, etc.
Message-Driven Trigger computation by invoking remoteentry methods
Non-blocking, Asynchronous Implicitly overlapped data transfer
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 35 / 37
Charm++Program Structure
Regular C++ codeI No special compilers
Small parallel interface description fileI Can contain control flow DAGI Parsed to generate more C++ code
Inherit from framework classes toI Communicate with remote objectsI Serialize objects for transmission
Exploit modern C++ program design techniques (OO, generics etc)
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 36 / 37
Charm++Program Structure
Regular C++ codeI No special compilers
Small parallel interface description fileI Can contain control flow DAGI Parsed to generate more C++ code
Inherit from framework classes toI Communicate with remote objectsI Serialize objects for transmission
Exploit modern C++ program design techniques (OO, generics etc)
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 36 / 37
Charm++Program Structure
Regular C++ codeI No special compilers
Small parallel interface description fileI Can contain control flow DAGI Parsed to generate more C++ code
Inherit from framework classes toI Communicate with remote objectsI Serialize objects for transmission
Exploit modern C++ program design techniques (OO, generics etc)
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 36 / 37
TRAM: Message Routing and Aggregation
Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 37 / 37