BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel
Machines
Gengbin ZhengGunavardhan Kakulapati
Laxmikant V. Kale
University of Illinois at Urbana-Champaign
IPDPS 4/29/2004 2
Motivations
Extremely large parallel machines around the corner• Examples:
• ASCI Purple (12K, 100TF)
• BlueGene/L (64K, 360TF)
• BlueGene/C (8M, 1PF)
• PF machines likely to have 100k+ processors (1M?)
Would existing parallel applications scale?• Machines are not there
• Parallel performance is hard to model without actually running the program
IPDPS 4/29/2004 3
BlueGene/L
IPDPS 4/29/2004 4
Roadmap
Explore suitable programming models• Charm++ (Message-driven)
• MPI and its extension - AMPI (adaptive version of MPI)
Use a parallel emulator to run applications Coarse-grained simulator for performance
prediction (not hardware simulation)
IPDPS 4/29/2004 6
Charm++ Object-based Programming Model Processor virtualization
• Divide computation into large number of pieces • Independent of number of processors• Typically larger than number of processors
• Let system map objects to processors• Empowers an adaptive, intelligent runtime system
User View
System implementation
IPDPS 4/29/2004 8
AMPI - MPI + processor virtualization
Implemented as virtual processors (user-level migratable threads)
Real Processors
7 MPI “processes”
IPDPS 4/29/2004 9
Parallel Emulator
Actually run a parallel program• Emulate full machine on existing parallel
machines Based on a common low level abstraction
(API)• Many multiprocessor nodes connected via
message passing Emulator supports Charm++/AMPI
Gengbin Zheng, Arun Singla, Joshua Unger, Laxmikant V. Kalé, ``A Parallel-Object Programming Model for PetaFLOPS Machines and Blue Gene/Cyclops'' in NGS Program Workshop, IPDPS2002
IPDPS 4/29/2004 10
Emulation on a Parallel Machine
Simulating (Host) Processor
Simulated multi-processor nodes
Simulated processor
Emulating 8M threads on 96 ASCI-Red processors
IPDPS 4/29/2004 12
Emulator to Simulator
Predicting parallel performance Modeling parallel performance accurately is
challenging• Communication subsystem
• Behavior of runtime system
• Size of the machine is big
IPDPS 4/29/2004 13
Performance Prediction
Parallel Discrete Event Simulation (PDES)• Logical processor (LP) has virtual clock
• Events are time-stamped
• State of an LP changes when an event arrives to it
Our emulator was extended to carry out PDES
IPDPS 4/29/2004 14
Predict Parallel Components
How to predict parallel components?• Multiple resolution levels
• Sequential component: • User supplied expression
• Performance counters
• Instruction level simulation
• Parallel component: • Simple latency-based network model
• Contention-based network simulation
IPDPS 4/29/2004 15
Prior PDES Work
Conservative vs. optimistic protocols• Conservative: (example: DaSSF)
• Ensure safety of processing events in global fashion
• Typically require a look-ahead – high global synchronization overhead
• MPI-SIM
• Optimistic: (examples: Time Warp, SPEEDS)• Each LP process the earliest event on its own, undo earl
ier out of order execution when causality errors occur
• Exploit parallelism of simulation better, and is preferred
IPDPS 4/29/2004 16
Why not use existing PDES?
Major synchronization overheads• Rollback/restart overhead
• Checkpointing overhead We can do better in simulation of some par
allel applications• Property of Inherent determinacy in parallel appli
cations
• Most parallel programs are written to be deterministic, example “Jacobi”
IPDPS 4/29/2004 17
Timestamp Correction
Messages should be executed in the order of their timestamps
Causality error due to out-of-order message delivery
Rollback and checkpoint are necessary in traditional methods
Inherent determinacy is hidden in applications
Need to capture event dependency• Run-time detection
• Use language “structured dagger” to express dependency
IPDPS 4/29/2004 18
Simulation of Different Applications Linear-order applications
• No wildcard MPI receives• Strong determinacy, no timestamp correction necessary
Reactive applications (atomic)• Message driven objects• Methods execute as corresponding messages arrive
Multi-dependent applications• Irecvs with WaitAll (MPI)• Uses of structured dagger to capture dependency (Charm
++)
IPDPS 4/29/2004 19
Structured-Daggerentry void jacobiLifeCycle(){ for (i=0; i<MAX_ITER; i++) { atomic {sendStripToLeftAndRight();} overlap { when getStripFromLeft(Msg *leftMsg) { atomic { copyStripFromLeft(leftMsg); } } when getStripFromRight(Msg *rightMsg) { atomic { copyStripFromRight(rightMsg); } } } atomic{ doWork(); /* Jacobi Relaxation */ } }}
IPDPS 4/29/2004 20
Time Stamping messages
LP Virtual Timer: curT
Message sent:RecvT(msg) = curT+Latency
Message scheduled:curT = max(curT, RecvT(msg))
IPDPS 4/29/2004 21
M1 M7M6M5M4M3M2
RecvTime
ExecutionTimeLine
M8
ExecutionTimeLineM1 M7M6M5M4M3M2 M8
RecvTime
Correction Message
Timestamps Correction
IPDPS 4/29/2004 22
Charm++ and MPI applications
Simulation output trace logs
Performance visualization (Projections)
BigSim Emulator
Charm++ Runtime Online PDES engine
Instruction Sim (RSim, IBM, ..)
Simple Network Model
Performance counters
Load Balancing Module
Architecture of BigSim Simulator
IPDPS 4/29/2004 23
Charm++ and MPI applications
Simulation output trace logs
BigNetSim (POSE)
Network Simulator
Performance visualization (Projections)
BigSim Emulator
Charm++ Runtime Online PDES engine
Instruction Sim (RSim, IBM, ..)
Simple Network Model
Performance counters
Load Balancing Module
Offline PDES
Architecture of BigSim Simulator
IPDPS 4/29/2004 25
BigSim Validation on LemieuxJ acobi 3D MPI
00.20.40.60.8
11.2
64 128 256 512number of processors si mul ated
time
(se
cond
s)
Actualexecuti on ti mepredi cted ti me
32 real processors
IPDPS 4/29/2004 26
Jacobi on a 64K BG/L
IPDPS 4/29/2004 27
Case Study - LeanMD Molecular dynamics simulation
designed for large machines K-away cut-off parallelization
•Benchmark er-gre with 3-away•36573 atoms•1.6 million objects•8 step simulation•32k processor BG machine•Running on 400 PSC Lemieux processors
Performance visualization tools
IPDPS 4/29/2004 28
Load Imbalance
Histogram
IPDPS 4/29/2004 29
Performance of the BigSim
Real processors (PSC Lemieux)
IPDPS 4/29/2004 30
Conclusions
Improved the simulation efficiency by taking advantage of “inherent determinacy” of parallel applications
Explored simulation techniques show good parallel scalability
http://charm.cs.uiuc.edu
IPDPS 4/29/2004 31
Future Work
Improving simulation accuracy• Instruction level simulator
• Network simulator
Developing run-time techniques (load balancing) for very large machines using the simulator