Download ppt - BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University

BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel

Machines

Gengbin ZhengGunavardhan Kakulapati

Laxmikant V. Kale

University of Illinois at Urbana-Champaign

IPDPS 4/29/2004 2

Motivations

Extremely large parallel machines around the corner• Examples:

• ASCI Purple (12K, 100TF)

• BlueGene/L (64K, 360TF)

• BlueGene/C (8M, 1PF)

• PF machines likely to have 100k+ processors (1M?)

Would existing parallel applications scale?• Machines are not there

• Parallel performance is hard to model without actually running the program

IPDPS 4/29/2004 3

BlueGene/L

IPDPS 4/29/2004 4

Roadmap

Explore suitable programming models• Charm++ (Message-driven)

• MPI and its extension - AMPI (adaptive version of MPI)

Use a parallel emulator to run applications Coarse-grained simulator for performance

prediction (not hardware simulation)

IPDPS 4/29/2004 6

Charm++ Object-based Programming Model Processor virtualization

• Divide computation into large number of pieces • Independent of number of processors• Typically larger than number of processors

• Let system map objects to processors• Empowers an adaptive, intelligent runtime system

User View

System implementation

IPDPS 4/29/2004 8

AMPI - MPI + processor virtualization

Implemented as virtual processors (user-level migratable threads)

Real Processors

7 MPI “processes”

IPDPS 4/29/2004 9

Parallel Emulator

Actually run a parallel program• Emulate full machine on existing parallel

machines Based on a common low level abstraction

(API)• Many multiprocessor nodes connected via

message passing Emulator supports Charm++/AMPI

Gengbin Zheng, Arun Singla, Joshua Unger, Laxmikant V. Kalé, ``A Parallel-Object Programming Model for PetaFLOPS Machines and Blue Gene/Cyclops'' in NGS Program Workshop, IPDPS2002

IPDPS 4/29/2004 10

Emulation on a Parallel Machine

Simulating (Host) Processor

Simulated multi-processor nodes

Simulated processor

Emulating 8M threads on 96 ASCI-Red processors

IPDPS 4/29/2004 12

Emulator to Simulator

Predicting parallel performance Modeling parallel performance accurately is

challenging• Communication subsystem

• Behavior of runtime system

• Size of the machine is big

IPDPS 4/29/2004 13

Performance Prediction

Parallel Discrete Event Simulation (PDES)• Logical processor (LP) has virtual clock

• Events are time-stamped

• State of an LP changes when an event arrives to it

Our emulator was extended to carry out PDES

IPDPS 4/29/2004 14

Predict Parallel Components

How to predict parallel components?• Multiple resolution levels

• Sequential component: • User supplied expression

• Performance counters

• Instruction level simulation

• Parallel component: • Simple latency-based network model

• Contention-based network simulation

IPDPS 4/29/2004 15

Prior PDES Work

Conservative vs. optimistic protocols• Conservative: (example: DaSSF)

• Ensure safety of processing events in global fashion

• Typically require a look-ahead – high global synchronization overhead

• MPI-SIM

• Optimistic: (examples: Time Warp, SPEEDS)• Each LP process the earliest event on its own, undo earl

ier out of order execution when causality errors occur

• Exploit parallelism of simulation better, and is preferred

IPDPS 4/29/2004 16

Why not use existing PDES?

Major synchronization overheads• Rollback/restart overhead

• Checkpointing overhead We can do better in simulation of some par

allel applications• Property of Inherent determinacy in parallel appli

cations

• Most parallel programs are written to be deterministic, example “Jacobi”

IPDPS 4/29/2004 17

Timestamp Correction

Messages should be executed in the order of their timestamps

Causality error due to out-of-order message delivery

Rollback and checkpoint are necessary in traditional methods

Inherent determinacy is hidden in applications

Need to capture event dependency• Run-time detection

• Use language “structured dagger” to express dependency

IPDPS 4/29/2004 18

Simulation of Different Applications Linear-order applications

• No wildcard MPI receives• Strong determinacy, no timestamp correction necessary

Reactive applications (atomic)• Message driven objects• Methods execute as corresponding messages arrive

Multi-dependent applications• Irecvs with WaitAll (MPI)• Uses of structured dagger to capture dependency (Charm

++)

IPDPS 4/29/2004 19

Structured-Daggerentry void jacobiLifeCycle(){ for (i=0; i<MAX_ITER; i++) { atomic {sendStripToLeftAndRight();} overlap { when getStripFromLeft(Msg *leftMsg) { atomic { copyStripFromLeft(leftMsg); } } when getStripFromRight(Msg *rightMsg) { atomic { copyStripFromRight(rightMsg); } } } atomic{ doWork(); /* Jacobi Relaxation */ } }}

IPDPS 4/29/2004 20

Time Stamping messages

LP Virtual Timer: curT

Message sent:RecvT(msg) = curT+Latency

Message scheduled:curT = max(curT, RecvT(msg))

IPDPS 4/29/2004 21

M1 M7M6M5M4M3M2

RecvTime

ExecutionTimeLine

M8

ExecutionTimeLineM1 M7M6M5M4M3M2 M8

RecvTime

Correction Message

Timestamps Correction

IPDPS 4/29/2004 22

Charm++ and MPI applications

Simulation output trace logs

Performance visualization (Projections)

BigSim Emulator

Charm++ Runtime Online PDES engine

Instruction Sim (RSim, IBM, ..)

Simple Network Model

Performance counters

Load Balancing Module

Architecture of BigSim Simulator

IPDPS 4/29/2004 23

Charm++ and MPI applications

Simulation output trace logs

BigNetSim (POSE)

Network Simulator

Performance visualization (Projections)

BigSim Emulator

Charm++ Runtime Online PDES engine

Instruction Sim (RSim, IBM, ..)

Simple Network Model

Performance counters

Load Balancing Module

Offline PDES

Architecture of BigSim Simulator

IPDPS 4/29/2004 25

BigSim Validation on LemieuxJ acobi 3D MPI

00.20.40.60.8

11.2

64 128 256 512number of processors si mul ated

time

(se

cond

s)

Actualexecuti on ti mepredi cted ti me

32 real processors

IPDPS 4/29/2004 26

Jacobi on a 64K BG/L

IPDPS 4/29/2004 27

Case Study - LeanMD Molecular dynamics simulation

designed for large machines K-away cut-off parallelization

•Benchmark er-gre with 3-away•36573 atoms•1.6 million objects•8 step simulation•32k processor BG machine•Running on 400 PSC Lemieux processors

Performance visualization tools

IPDPS 4/29/2004 28

Load Imbalance

Histogram

IPDPS 4/29/2004 29

Performance of the BigSim

Real processors (PSC Lemieux)

IPDPS 4/29/2004 30

Conclusions

Improved the simulation efficiency by taking advantage of “inherent determinacy” of parallel applications

Explored simulation techniques show good parallel scalability

http://charm.cs.uiuc.edu

IPDPS 4/29/2004 31

Future Work

Improving simulation accuracy• Instruction level simulator

• Network simulator

Developing run-time techniques (load balancing) for very large machines using the simulator