108
1 dist-gem5: Distributed Simulation of Computer Clusters Illinois: Mohammad Alian , Prof. Nam Sung Kim ARM: Gabor Dozsa, Stephan Diestelhorst, Nikos Nikoleris, Radhika Jagtap Tutorial at IEEE International Symposium on Workload Characterization (IISWC), Seattle, USA 1 Oct 2017

dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

  • Upload
    lyhanh

  • View
    235

  • Download
    2

Embed Size (px)

Citation preview

Page 1: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

1

dist-gem5: Distributed Simulation of

Computer Clusters

Illinois: Mohammad Alian, Prof. Nam Sung Kim

ARM: Gabor Dozsa, Stephan Diestelhorst, Nikos Nikoleris, Radhika Jagtap

Tutorial at IEEE International Symposium on Workload Characterization (IISWC), Seattle, USA

1 Oct 2017

Page 2: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

2

▪ Introduction

▪ dist-gem5 architecture

▪ Packet forwarding; Synchronization; Checkpointing; Network simulation

▪ Validation; Speedup

-- 10:15 AM to 10:45AM -- Break --

▪ Getting started with dist-gem5

▪ Prerequisites; Compiling; Running example script

▪ Launch script walk through; Checkpointing/restoring

▪ Network modeling

▪ Debugging

▪ Preparing benchmarks

▪ Apache bench

▪ Demo

Programme

Page 3: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

3

▪ Introduction

▪ dist-gem5 architecture

▪ Packet forwarding; Synchronization; Checkpointing; Network simulation

▪ Validation; Speedup

-- 10:15 AM to 10:45AM -- Break --

▪ Getting started with dist-gem5

▪ Prerequisites; Compiling; Running example script

▪ Launch script walk through; Checkpointing/restoring

▪ Network modeling

▪ Debugging

▪ Preparing benchmarks

▪ Apache bench

▪ Demo

Programme

Page 4: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

4

▪ Definition

▪ A cluster of computers that communicate and interact with each other by passing messages over

the network to process given tasks.

▪ Examples

▪ Datacenters, supercomputers

Distributed computer systems

The IBM Blue Gene/P supercomputer "Intrepid"

at Argonne National Laboratory runs 164,000 processor

cores in 40 racks/cabinets connected by a high-speed 3-

D torus network.

A Google datacenter

Page 5: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

5

▪ To maximize performance and/or energy-efficiency, we must capture the intricate

interplay amongst computers and their HW/SW sub-systems, especially due to

communications and interactions w/ each other by passing messages over the network

Exploring and optimizing distributed computer systems

Request Response

ResponseRequest

Network

Clients

Servers

0

0.5

1

1.5

2

2.5

3

3.5

0.0

0.2

0.4

0.6

0.8

1.0

0.14 0.19 0.24

Fre

qu

en

cy (

GH

z)

Uti

lizati

on

Time (s)

BW(rx)

BW(tx)

U(core)

F(core)

Page 6: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

6

Using physical computers

▪ Advantage

▪ Fast evaluations for large-scale distributed computer systems

▪ Disadvantage

▪ Limited design space exploration (unable to explore distributed computer systems based on future

processor and computer sub-systems architectures that have not been developed yet)

Using queuing-theoretic models

▪ Advantage

▪ Simple and fast evaluations for large-scale distributed computer systems

▪ Disadvantage

▪ Inaccurate/misleading evaluations (unable to capture complex interplay b/w HW/SW sub-systems of

computers)

Past methods exploring distributed computer systems [1]

Page 7: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

7

Using existing (full-system) simulators

▪ Advantage

▪ More flexible design space exploration than physical computer systems

▪ More precise evaluation than queuing-theoretic models

▪ Disadvantage

▪ gem5: limited scalability w/ slow evaluation (legacy gem5)

▪ Not flexible (SST + gem5)

▪ Proprietary and limited to x86 (COTSON)

Past methods exploring distributed computer systems [2]

Page 8: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

8

▪ Evaluating performance and power dissipation of a distributed system

▪ Complex interplay among system components at scale

▪ Demanding a full-system, cycle-level simulator which is fast enough to simulate a large-

scale computer system

▪ Enabling distributed simulation:

▪ Simulation of a distributed computer

▪ system w/ many simulation hosts

dist-gem5

scaleOS

ISAscaches

memory

network

devices

performancePower

cores

Page 9: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

9

▪ Product of excellent synergistic collaboration b/w industry and academia

▪ Integrating the best features of concurrently developed multi-gem5 from ARM and pd-gem5 from U.

of Illinois for fast and deterministic simulations of distributed computer simulations

History of dist-gem5 development

pd-gem5 multi-gem5

U. of Illinois ARM Research

dist-gem5

[Best Paper Finalist] M. Alian, et al., “dist-gem5: Distributed Simulation of Computer Clusters,”IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), April 2017.

M. Alian, et al., pd-gem5: Simulation infrastructure for parallel/distributed computer systems. IEEE Computer Architecture Letters, vol: 15, no: 1, 2016.

Page 10: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

10

Example of research w/ dist-gem5

Datacenter power management algorithm

▪ Desired P/C-state governor

▪ react to change in core utilization in a timely manner

▪ Approaches

▪ predict changes in core utilization

▪ core utilization is highly correlated w/ network activity

▪ Hide P/C-state transition latency

▪ overlap P/C-state transition w/ packet reception

and processing

BW(rx)

U(core)

MC

DMA

Interrupt

Handler

SoftIRQ

1 2 n...

rx_desc_ring

s

k

b

Network

Stack

DRAM

s

k

b

s

k

b

Copy to

User

p

k

t

NIC

NIC

CPU

DRAMRCPCIe

Channel

[Nominated for the Best Paper Award] M. Alian, et al. “NCAP: Network-Driven, Packet

Context-Aware Power Management for client-server architecture. IEEE International

Symposium on High-Performance Computer Architecture (HPCA), February 2017.

Page 11: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

11

Other promising research directions

▪ Exploring HW/SW cross-layer approaches for datacenter computers and their sub-

systems

▪ Exploiting information from network HW/SW layers as hints for efficient management of computer

resource management (e.g., prefetching pages from slow to fast memory in hybrid memory system)

▪ Off-loading simple data-intensive operations to network interface cards (NICs)

▪ Developing efficient evaluation methodologies for large-scale distributed computer

systems

▪ Exploring systematic hybrid evaluation approaches judiciously mixing queuing-theoretic modeling

and dist-gem5-based simulation approaches for efficiently evaluating a VERY large-scale distributed

computer systems (e.g., obtaining detailed parameters for queuing-theoretic analytical model using

dist-gem5)

Page 12: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

12

▪ Introduction

▪ dist-gem5 architecture

▪ Packet forwarding; Synchronization; Checkpointing; Network simulation

▪ Validation; Speedup

-- 10:15 AM to 10:45AM -- Break --

▪ Getting started with dist-gem5

▪ Prerequisites; Compiling; Running example script

▪ Launch script walk through; Checkpointing/restoring

▪ Network modeling

▪ Debugging

▪ Preparing benchmarks

▪ Apache bench

▪ Demo

Programme

Page 13: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

13

Michigan m5 + Wisconsin GEMS = gem5

“The gem5 simulator is a modular platform for computer-system architecture research,

encompassing system-level architecture as well as processor microarchitecture.”

What is gem5?

Page 14: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

14

Level of detail

▪ HW Virtualization

▪ Very no/limited timing

▪ The same Host/guest ISA

▪ Functional mode

▪ No timing, chain basic blocks of instructions

▪ Can add cache models for warming

▪ Timing mode

▪ Single time for execute and memory lookup

▪ Advanced on bundle

▪ Detailed mode

▪ Full out-of-order, in-order CPU models

▪ Hit-under-miss, reodering, …

µarch Exploration

HW Validation

Perf. Validation

Cycle Accurate

1–50 KIPS

RTL simulation

High-level perf./power

Architecture exploration

Approximately Timed

0.2–3 MIPS

gem5

Loosely Timed

50–200 MIPS

Qemu

SW Dev

HW Virt.

gem5 + kvm

GIPS

Page 15: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

15

Level of detail

▪ HW Virtualization

▪ Very no/limited timing

▪ The same Host/guest ISA

▪ Functional mode

▪ No timing, chain basic blocks of instructions

▪ Can add cache models for warming

▪ Timing mode

▪ Single time for execute and memory lookup

▪ Advanced on bundle

▪ Detailed mode

▪ Full out-of-order, in-order CPU models

▪ Hit-under-miss, reodering, …

µarch Exploration

HW Validation

Perf. Validation

Cycle Accurate

1–50 KIPS

RTL simulation

High-level perf./power

Architecture exploration

Approximately Timed

0.2–3 MIPS

gem5

Loosely Timed

50–200 MIPS

Qemu

SW Dev

HW Virt.

gem5 + kvm

GIPS

Page 16: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

16

When not to use gem5

▪ Performance validation

▪ gem5 is not a cycle-accurate microarchitecture model!

▪ This typically requires more accurate models such as RTL simulation.

▪ Commercial products such as ARM CycleModels operate in this space.

▪ Core microarchitecture exploration

▪ Only do this if you have a custom, detailed, CPU model!

▪ gem5’s core models were not designed to replace more accurate microarchitectural models.

▪ To validate functional correctness or test bleeding-edge ISA improvements

▪ gem5 is not as rigorously tested as commercial products.

▪ New (ARMv8.0+) or optional instructions are sometimes not implemented

▪ Commercial products such as ARM FastModels offer better reliability in this space.

Page 17: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

17

Why gem5?

▪ Runs real workloads▪ Analyze workloads that customers use and care about

▪ … including complex workloads such as Android

▪ Comprehensive model library▪ Memory and I/O devices

▪ Full OS, Web browsers

▪ Clients and servers

▪ Rapid early prototyping▪ New ideas can be tested quickly

▪ System-level impact can be quantified

▪ System-level insights▪ Enables us to study complex

memory-system interactions

▪ Can be wired to custom models

Ubuntu (Linux 4.x) Android Nougat

Page 18: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

18

• Some timing

• Caches

• No BPs

• Fast

• Some timing

• Caches

• Limited BPs

• Fast

• Full timing

• Caches

• Branch predictors

• Slow

• No timing

• No caches

• No BP

• Really fast

CPU models overview

BaseCPU

BaseKvmCPU TraceCPUBaseSimpleCPU

AtomicSimpleCPU

TimingSimpleCPU

DerivO3CPU MinorCPU

X86KvmCPU

ArmV8KvmCPU

Page 19: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

19

Discrete event based simulation

▪ Discrete: Handles time in discrete steps

▪ Each step is a tick

▪ Usually 1THz in gem5

▪ Simulator skips to the next event on the timeline

Time

Event handler

Event handlerMyObj::startup()Schedule

Call

Page 20: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

Simulating a distributed system

Page 21: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

21

Host #1

Distributed gem5 Simulation – high level view

Host #1

simulated

system

#1

Host #2

Host #3

Packet

forwarding

▪ gem5 processes modeling full systems run in parallel

on a cluster of host machines

▪ Packet forwarding engine

▪ Forward packets among the simulated systems

▪ Synchronize the distributed simulation

▪ Simulate network topology

gem5 process

host machine

simulated

system

#2

simulated

system

#3

Page 22: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

22

Core components

Packet forwardingDistributed

checkpointing

Synchronization

Simulated network

Page 23: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

23

Core components

Packet forwardingDistributed

checkpointing

Synchronization

Simulated network

Page 24: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

24

physical host #1

physical host #3

physical host #2

physical switch

phys

NIC#1

phys

NIC#

2

phys

port1

phys

port2

phys

port3

phys

NIC#3

dist-gem5 architecture – packet forwarding

Page 25: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

25

physical host #1

physical host #3

physical host #2

physical switch

phys

NIC#1

phys

NIC#2

phys

port1

phys

port2

phys

port3

phys

NIC#3

dist-gem5 architecture – packet forwarding

gem5 #1

simulated

system #1

sim

NIC

gem5 #3

simulated switch

gem5 #2

simulated

system #2

sim

NIC

sim

port

0

sim

port

1

Page 26: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

26

physical host #1

physical host #3

physical host #2

physical switch

phys

NIC#1

phys

NIC#2

phys

port1

phys

port2

phys

port3

phys

NIC#3

gem5 #1

simulated

system #1

sim

NIC

gem5 #3

simulated switch

gem5 #2

simulated

system #2

sim

NIC

sim

port

0

sim

port

1

dist-gem5 architecture – packet forwarding

simulated packets

are embedded into

host TCP/IP

packetssim pkt

TCP sim pkt

sim pktTCP sim pkt

sim pkt

Page 27: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

27

Asynchronous processing of incoming messages

▪ Simulation thread (main thread)

▪ Process/insert events in the event queue

▪ In case of send pkt event, encapsulate the simulated

Ethernet packet in a message and send it out

▪ Receiver thread

▪ Create for each gem5 process

▪ Waits for incoming packets

▪ Creates a recv pkt event and insert it to the event

queue

eventQsimulation

threadsend pkt

recv pkt

physical host

gem5 process

receiver

thread

phys

NIC

Page 28: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

28

▪ What is the correct tick for the receive event?

▪ st : send tick

▪ lat: simulated link latency

▪ bw: simulated link bandwidth (bytes/tick)

▪ size: simulated packet size (bytes)

▪ rt: receive tick

rt = st + lat + size / bw

▪ Accurate simulation

▪ rt >= curTick() when the receiver gem5 gets the real message encapsulating the simulated packet

▪ receiver gem5 can schedule the receive event for the simulated NIC

Simulation accuracy and packet forwarding

Head

Tail

event queue

receive frame

tim

e

curTick

Page 29: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

29

Core components

Packet forwardingDistributed

checkpointing

Synchronization

Simulated network

Page 30: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

30

Need for synchronization

30

wall clock time

• Receiver gem5 can run ahead of

sender gem5

✓Physical host mismatch

✓Different events to be processed

• Slowed down receiver gem5 to

ensure simulation accuracy

• Quantum-based synchronization

gem5#0

gem5#1

send time

expected

delivery time

simulated network delay

recv time

late packet arrival

Page 31: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

31

Accurate packet forwarding

31

wall clock time

• quantum: interval for periodic synchronization in simulated time

• Sync-event flushes inter gem5 communication channels

• If quantum ≤ simulated link delay:

✓expected delivery tick falls inside the next quantum

• Optimal quantum size for accurate forwarding == simulated link delay

gem5#0

gem5#1

gem5#0

gem5#1

send time

packet arrival wall

clock time

expected

delivery time

simulated network delay

quantum

global sync

quantum

Page 32: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

32

▪ Simulation progress gets stopped at each sync

tick in each gem5 process

▪ Simulated compute node

▪ Sends out ‘synq request’ message

▪ Waits until ‘sync ack’ message comes back

▪ Simulated switch

▪ Waits until it receives a ‘sync request’ message

▪ Sends out ‘sync ack’ message

Compute nodes, switch and synchronization

compute

node

gem5

Ethernet

switch

gem5

compute

node

gem5

compute

node

gem5

Page 33: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

33

▪ A vanilla global gem5 event is scheduled at each sync tick in each gem5 process

▪ A global gem5 event is a transparent thread barrier (in case of multiple simulation threads)

▪ dist-gem5 global sync is prepared to work with multi-queue/multi-threaded gem5 simulations

The global sync event

▪ The process() method in a compute node

▪ sends out ‘sync request’ messages for each

simulated link

▪ waits on a condition variable to get notified

about completion by the receiver thread

▪ The process() method in a switch

▪ waits for completion notification from the

receiver thread

▪ sends out ‘sync ack’ messages for each

simulated link

▪ Receiver thread keeps processing incoming messages while simulation thread is blocked

▪ creates receive events in the event queue for simulated Ethernet frames

▪ notifies blocked simulation thread when ‘sync

ack’ messages arrive

▪ notifies blocked simulation thread when

‘sync request’ messages arrive

Page 34: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

34

Core components

Packet forwardingDistributed

checkpointing

Synchronization

Simulated network

Page 35: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

35

▪ Checkpoint support for dist-gem5 relies on the mainline gem5 checkpoint support

▪ Each gem5 process of a dist-gem5 run creates its own checkpoint

▪ dist-gem5 adds an extra co-ordination layer to ensure correctness

▪ No in-flight message may exist among gem5 processes when the distributed checkpoint is taken

Distributed checkpointing

m5 checkpoint pseudo inst

exitSimLoop() drain() serialize() drainResume() simulate()

m5 checkpoint pseudo inst

global syncexitSimLoop()

drain() global sync serialize()drain

Resume()simulate()

dist-gem5 checkpoint co-ordination

Page 36: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

36

Distributed checkpointing (cont.)

▪ Checkpoint can only be initiated at a periodic global sync

▪ Simplifying implementation without scarifying usability

Checkpoint flavour collaborative

checkpoint

immediate checkpoint

Condition all compute nodes signal

intent

at least one compute node

signals intent

Example use case Instrumented MPI

application source code to

take a checkpoint at the

MPI_barrier() before ROI

Taking a checkpoint from

the bootscript before

starting an MPI application

(i.e. before calling ‘mpirun’)

Page 37: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

37

Checkpoint @ global sync

▪ In practical use cases a distributed checkpoint is taken “near” an application barrier (e.g.

MPI_Barrier() or mpirun)

▪ We want to take the checkpoint when all processes hit the barrier in the application code =>

desired application state can be captured even if we allow checkpoint writes only at global sync

▪ At a global sync

▪ A compute node gem5 processes can signal its intention to take a checkpoint

▪ ‘m5 checkpoint’ pseudo instruction => ‘need checkpoint’ meta info in the next ‘sync request’ message

▪ Switch gem5 process can command to write a checkpoint

▪ ‘write checkpoint’ meta info in the ‘sync ack’ message => exitSimLoop() in all gem5 processes

Page 38: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

38

Writing a checkpoint

draining

gem5#0

gem5#1

Wall clock time

p

writing

checkpoint

p

global sync

d0

dist checkpoint starts

d1

q – d0

q : sync quantum ticks

d0, d1: drain ticks

▪ Distributed checkpoint can start

only at a global sync

▪ Draining may require different

number of ticks in each gem5

▪ After drain complete, we flush out

in-flight messages with an extra

global sync

▪ Global sync implements both an

execution and a data (message)

barrier

q – d1

Page 39: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

39

Restoring from a checkpoint

Wall clock time

global sync

q’ : sync quantum ticks

d0, d1 : drain ticks

align

ticks

restoringfrom

checkpoint

d0

d1

q’

q’

d’

draining

▪ Checkpoint might be written at

different ticks in different gem5

processes

▪ An additional global sync to align

the ticks:

d0 + d’ = d1

▪ Global sync delivers the max tick

value to all gem5 processes

▪ Periodic global sync always happens

at the same tick in every gem5

▪ Global sync period may change at

restore

▪ Same checkpoint can be used to

explore different network link

latency/bandwidth effects

Page 40: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

40

▪ User is allowed to change simulated link parameters when restoring from a checkpoint

▪ Same checkpoint can be used to explore different network link latency/bandwidth effects

▪ Global sync period may change at restore (if the simulated link latency change)

▪ Checkpoint may contain simulated packets to get received in the future

▪ Receive ticks for such packets need to be adjusted to reflect the change of the simulated link

parameters

Restoring from a checkpoint (cont.)

Page 41: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

41

Core components

Packet forwardingDistributed

checkpointing

Synchronization

Simulated network

Page 42: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

42

server #2

dist-gem5 architecture – network modeling

42

Server #1

server #3

server #4

server #5

server #6

server #7

Server #0

top of rack

switch #0

server #10

server #9

server #11

server #12

server #13

server #14

server #15

server #8

top of rack

switch #1

server #58

server #57

server #59

server #60

server #61

server #62

server #63

server #56

top of rack

switch #7

aggregate

switch

. . .

simulate in one gem5 process

Page 43: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

43

▪ Configurable baseline Ethernet switch model

▪ Port number, delay, bandwidth, buffer size

physical host

Configurable network model

43

top of rack

switch #0

top of rack

switch #1top of rack

switch #7

aggregate

switch

p8

p0 p7 p0 p7

p8

. . .. . . . . .

p0 p7p1

p8

gem5

simulated etherLinksimulated port

distEtherLink

simulated etherSwitch

p0 p7

MAC

Table In-orderQ#0

In-orderQ#n

IPORT#0

IPORT#n

OPORT#0

OPORT#n

. . .

. . .

Page 44: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

Deterministic simulation

Page 45: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

45

▪ We assume that a single compute node gem5 simulation is deterministic

▪ Ordering and speed of dist-gem5 messages in real world

▪ Speed of gem5 processes (relative to each other) may vary

▪ Communication speed among gem5 process may vary

▪ Global sync guarantees deterministic packet forwarding

▪ sync quantum <= simulated link latency

▪ global sync is a message barrier

Deterministic execution issues

Page 46: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

46

: message

delivery in

wall clock

time

Global sync and deterministic packet forwarding

gem5#0

gem5#1

gem5#0

gem5#1

qsend tick#1

receive tick#1n

q

global sync

q : global sync

period in ticks

(quantum)

n: simulated

link latency in

ticks

gem5#2 gem5#2

n receive

tick#2

send tick#2

wall clock time

▪ Receive tick for a simulated

packet may not fall within the

same quantum which the

message gets received in

▪ A message is always gets sent

and received within a single

quantum

Page 47: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

Validation and speedup

[Best Paper Finalist] M. Alian, et al., “dist-gem5: Distributed Simulation of Computer Clusters,”

IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), April 2017.

Page 48: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

49

Methodology – simulation techniques

▪ For example, simulating a cluster w/ 7 nodes and 1 network switch:

quad core physical host

gem5#6

system#6

gem5#7

switch

gem5#4

system#4

gem5#2

system#2

gem5#0

system#0

gem5#5

system#5

gem5#3

system#3

gem5#1

system#1

quad core physical host

gem5#6

system#6

gem5#7

switch

gem5#4

system#4

gem5#5

system#5

quad core physical host

gem5#0

system#6 switch

system#4

system#2

system#0

system#5

system#3

system#1

quad core physical host

gem5#6

system#2

gem5#7

system#3

gem5#4

system#0

gem5#5

system#1

single-threaded-gem5 dist-gem5parallel-gem5

Page 49: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

50

Methodology – experimental setup

▪ Focus on off-chip network performance using network intensive applications

▪ iperf, memcached, httperf, tcptest, netperf, NAS parallel benchmark

▪ Verification/validation against:

▪ Single-threaded-gem5

▪ Physical cluster

▪ 4 node cluster w/ AMD A10-5800K

▪ Speedup comparison against:

▪ Single-threaded-gem5

▪ Parallel-gem5

category gem5 configuration

O3 core 4 cores; 4 way superscalar

memory 8GB DDR3 1600 MHz

network Intel GbE NIC; 1 μs Link latency

OS Linux Ubuntu 14.04 (Kernel 4.3)

Page 50: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

51

Validation – network latency and bandwidth

▪ iperf (left) and memcahed (right)

▪ Follows the behavior of physical setup

▪ 17.5% lower response time for memcached

0.0

0.3

0.6

0.9

1.2

1.5

0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

Late

ncy (

ms

)

Bandwidth (Gbps)

dist-gem5

phys

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

1 5 10 20 30 40 50 60 70 80 90 95

Late

ncy (

ms)

memcached Distribution Percentile

dist-gem5

phys

Page 51: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

52

Speedup – simulation time reduction

▪ Running httperf on each simulated node sending

fixed number of requests to a unique simulated

node (apache server)

▪ Compared with single-threaded-gem5

▪ dist-gem5 simulating 63 nodes on 16 physical

hosts is

▪ 83.1 faster than single-threaded-gem5

▪ 12.8 faster than parallel-gem5

2.76.3

21.8

36.0

83.1

2.7 3.76.6 6.0 6.5

0

10

20

30

40

50

60

70

80

90

3 7 15 31 63

Sp

ee

du

p (

No

rm.

sin

gle

-th

read

ed

-gem

5)

Number of Simulated Nodes

dist-gem5 parallel-gem5

speedup of parallel-gem5 saturates!

Page 52: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

53

Scalability – simulation time vs. simulated cluster size

▪ Simulation time increase for simulating 63 vs. 3 nodes:

▪ 57.3 for Single-threaded-gem5

▪ 23.9 for parallel-gem5

▪ 1.9 for dist-gem5

1.41.9

1.9

3.9

11.2

23.9

1.0

2.6

9.4

25.0

57.3

1.0

10.0

100.0

0 10 20 30 40 50 60 70

No

rmalized

Sim

ula

tio

n T

ime

Number of Simulated Nodes

dist-gem5 parallel-gem5 single-threaded-gem5

dist-gem5 scales well!

Page 53: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

54

Synchronization overhead

▪ Sweep synchronization quantum size

▪ # of http req remains near constants

▪ Maximum 2.6% variance

▪ Almost the same amount of work done at each

quantum size

▪ Simulation time improvement

▪ 4.9% from 0.5 μs to 1 μs

▪ 15.7% from 0.5 μs to 128 μs

0

4

8

12

16

20

0.0

0.2

0.4

0.6

0.8

1.0

1.2

0.5 1 2 4 8 16 32 64 128

Nu

mb

er

of

Req

uests

(K

Req

)

No

rma

lized

Sim

ula

tio

n T

ime

Synchronization Quantum Size (μs)

Simulation Time Req#

dist-gem5 synchronization is efficient!

Page 54: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

55

▪ Introduction

▪ dist-gem5 architecture

▪ Packet forwarding; Synchronization; Checkpointing; Network simulation

▪ Validation; Speedup

-- 10:15 AM to 10:45AM -- Break --

▪ Getting started with dist-gem5

▪ Prerequisites; Compiling; Running example script

▪ Launch script walk through; Checkpointing/restoring

▪ Network modeling

▪ Debugging

▪ Preparing benchmarks

▪ Apache bench

▪ Demo

Programme

Page 55: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

56

▪ Introduction

▪ dist-gem5 architecture

▪ Packet forwarding; Synchronization; Checkpointing; Network simulation

▪ Validation; Speedup

-- 10:15 AM to 10:45AM -- Break --

▪ Getting started with dist-gem5

▪ Prerequisites; Compiling; Running example script

▪ Launch script walk through; Checkpointing/restoring

▪ Network modeling

▪ Debugging

▪ Preparing benchmarks

▪ Apache bench

▪ Demo

Programme

Page 56: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

Getting started with gem5 FS mode

Page 57: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

58

Download and build gem5

▪ Guest architecture

▪ Several architectures in the source

tree.

▪ Most common ones are:

▪ ARM

▪ NULL – Used for trace-drive simulation

▪ X86

▪ Optimization level:

▪ debug: Debug symbols, no/few

optimizations

▪ opt: Debug symbols + most

optimizations

▪ fast: No symbols + even more

optimizations

dist-gem5 currently support ARM. We have tested x86 and the patches are on their way

Page 58: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

59

Example disk images

▪ Example kernels and disk images can be downloaded from gem5.org/Download

▪ This includes pre-compiled boot loaders

▪ Set the M5_PATH variable to point to the extracted directory

▪ Most example scripts try to find files using M5_PATH

▪ Kernels/boot loaders/device trees in ${M5_PATH}/binaries

▪ Disk images in ${M5_PATH}/disks

Page 59: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

60

Running an example script

▪ Simulates an arm64 system with 4 cores

▪ Uses a functional ‘atomic’ CPU model

▪ Runs the script on the simulated system after booting the Linux

▪ Using “init-param” you can set a parameter which is accessible from simulated system

Page 60: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

61

Sample rcS script and gem5 terminal output

gem5 terminal outputsample rcS script

Page 61: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

62

System overview

▪ gem5 will sketch the system overview for you if you install “pydot” on the host

▪ apt-get install python-pydot

Core0 Core1 Core2 Core3

MemCrtl IO Devices

Ethernet Card

Chipset

Page 62: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

Getting started with dist-gem5

Page 63: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

64

▪ before you run a dist-gem5 simulation you need:

▪ Setup passwordless ssh between the “launch host” and “simulation hosts”

▪ Set LSB_MCPU_HOSTS to map gem5 processes to simulation hosts

▪ Default will run all processes on localhost

▪ Assuming sim cluster size 4, the following will run 2 full-system gem5 processes on 10.10.10.2 and 2 on

10.10.10.3

Getting started with dist-gem5

Page 64: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

65

Running an example script

dist-gem5 launch script

Page 65: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

66

Running an example script

gem5-dist.sh options

simulated cluster sizegem5 executableswitch node config scriptfull-system nodes config script

root dir that stores logs and stats

$RUNDIRlog.switchlog.0log.1…log.(N-1)m5out.switchm5out.0…m5out.(N-1)

gem5 strerr/stdout

gem5 outdir

Page 66: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

67

Running an example script

simulated cluster sizegem5 executableswitch node config scriptfull-system nodes config script

root dir that stores logs and statsfull-system nodes arguments

gem5 binary arguments

Page 67: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

68

Sample rcS script for a dist-gem5 run

Page 68: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

69

Sample rcS script for a dist-gem5 run

assign MAC/IP addr and bring up the NIC

ping other nodes from node 0

Page 69: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

70

dist-gem5 output terminal for the example rcS script

node w/ rank 0

rank 1

rank 2

rank 3

Page 70: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

dist-gem5 launch script walk-through

Page 71: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

72

1. Launch a gem5 process simulating the network switch

2. Wait for switch process to start

3. Read dist-iface port# from log.switch

4. Start full-system gem5 processes

gem5-dist.sh script big picture

switch

ssh

sw port #

log.switch

ssh,

sw port#

Node Node Node Node

dist-etherlink

Each FS gem5 process will connect to acorresponding switch iface at the process startup

Page 72: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

73

gem5-dist.sh script walk through

Step 1

Step 2

Step 3

Page 73: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

74

gem5-dist.sh script walk through

Step 4

Page 74: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

75

▪ Default gem5-dist.sh runs identical full-system gem5 nodes

Heterogenous cluster modeling

shared for all full-system nodes

Page 75: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

76

▪ Default gem5-dist.sh runs identical full-system gem5 nodes

▪ Not desirable always

▪ We do not need to always simulate the entire cluster with high fidelity

▪ Server node with OOO and clients with atomic CPU

▪ Simulating a heterogenous cluster

▪ Nodes with different number/type of CPUs

▪ Nodes with different memory size/type

▪ Nodes with different ISAs!

▪ Modify gem5-dist.sh script to easily achieve that

▪ Let’s see how to have different arguments for node 0

Heterogenous cluster modeling

Page 76: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

77

▪ Declare a new variable for node0 arguments (“N0_ARGS”)

gem5-dist.sh changes to support heterogeneous dist runs (1)

Page 77: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

78

▪ Define a new option flag “—node0-args” and set “N0_ARGS” from command line

arguments

gem5-dist.sh changes to support heterogeneous dist runs (2)

Page 78: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

79

▪ Add “N0_ARGS” to node 0’s arguments

gem5-dist.sh changes to support heterogeneous dist runs (3)

Page 79: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

80

Example script with –node0-args

node0 is quad core and the rest

are single core

Page 80: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

81

▪ The key is gem5-dist.sh scripts

▪ You can easily extend it to have desired dist-gem5 launches

▪ Heterogenous cluster simulation

▪ Arbitrary gem5 process to physical host/core mapping

▪ …

▪ Support for simulation pool management software

▪ Instead of explicitly mapping processes to nodes, and using ssh to run gem5 processes, the cluster

management software maps and runs processes

▪ E.g. a HT-Condor version is available in https://publish.illinois.edu/icsl-pdgem5/download/

Other dist-gem5 simulation approaches

Page 81: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

dist-gem5 checkpointing/restoring

Page 82: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

83

▪ Specify the root checkpoint directory using “-c” option in gem5-dist.sh

▪ All the checkpoints would have the same dump tick

▪ You can restore by passing “—checkpoint-restore” option to all gem5 processes (full-

system processes + switch process)

Checkpointing/restoring

$CKPTDIR

m5out.switch

m5out.0

m5out.(N-1)

Stores checkpoint files for the gem5 process simulating network switch

Stores checkpoint files for the gem5 process simulating full-system node #0

Page 83: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

84

Restoring from a checkpoint

--cf-args will add its options to all the gem5

processes in the simulated cluster

(full-system processes + switch process)

Page 84: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

85

▪ Introduction

▪ dist-gem5 architecture

▪ Packet forwarding; Synchronization; Checkpointing; Network simulation

▪ Validation; Speedup

-- 10:15 AM to 10:45AM -- Break --

▪ Getting started with dist-gem5

▪ Prerequisites; Compiling; Running example script

▪ Launch script walk through; Checkpointing/restoring

▪ Network modeling

▪ Debugging

▪ Preparing benchmarks

▪ Apache bench

▪ Demo

Programme

Page 85: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

86

Star tree

Network topologies

physical host

top of rack

switch #0top of rack

switch #7

aggregate

switch

p8

p0 p7

p8

. . . . . .

p0 p7p1

gem5

simulated etherLink

simulated port

distEtherLink

simulated etherSwitch

p0 p7

physical host

top of rack

switch

p0 p7

. . . . . .

gem5

p0 p63

. . .

. . .

. . .

Page 86: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

87

Star network topology config script

Instantiate 64 DistEtherLink SimObjects (dist_size == 64)

Page 87: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

88

Star network topology config script

physical host

top of rack

switch

p0 p7

. . . . . .

gem5

p0 p63

Page 88: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

89

Tree topology config script

Instantiate 8 top of rack and 1 aggregate switch

Page 89: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

90

Tree topology config script

Again we need 64 DistEtherLinks

Page 90: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

91

Tree topology config script (cont.)

Instantiate 8 aggregate EtherLinks

Page 91: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

92

Tree topology config script (cont.)

Connect DistEtherLinks to top of rack switches

Page 92: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

93

Tree topology config script (cont.)

Use EtherLinks to connect aggregate and top of rack switches

Page 93: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

94

Tree topology config script (cont.)

physical host

top of rack

switch #0top of rack

switch #7

aggregate

switch

p8

p0 p7

p8

. . . . . .

p0 p7p1

gem5

p0 p7

Page 94: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

95

▪ Introduction

▪ dist-gem5 architecture

▪ Packet forwarding; Synchronization; Checkpointing; Network simulation

▪ Validation; Speedup

-- 10:15 AM to 10:45AM -- Break --

▪ Getting started with dist-gem5

▪ Prerequisites; Compiling; Running example script

▪ Launch script walk through; Checkpointing/restoring

▪ Network modeling

▪ Debugging

▪ Preparing benchmarks

▪ Apache bench

▪ Demo

Programme

Page 95: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

96

▪ Look into log.switch, log.0, …, log.N-1

▪ Unexpected abortion of a gem5 process

▪ Segmentation fault, panic, failed connection, …

▪ Normal exit from rcS script

▪ E.g. “info: m5 exit called with non-zero delay” message in a log file

▪ Check m5out.X/system.terminal to find out why “/sbin/m5 exit” gets called

▪ Check if there is any gem5 process running on simulation hosts

▪ Trace based debug

▪ Enable some debug flags and look into log files to get more info

▪ gem5-dist.sh:

dist-gem5 debugging checklist

Page 96: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

97

▪ GDB debugging

▪ Use “-debug” option for gem5-dist.sh

▪ Debug each gem5 binary using gbd debugger

▪ Each gem5 process will open a gdb terminal and runs from there

dist-gem5 debugging checklist (cont.)

Page 97: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

101

4 node simulated cluster debugging using gdb

Page 98: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

102

4 node simulated cluster debugging using gdb

Page 99: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

103

▪ Introduction

▪ dist-gem5 architecture

▪ Packet forwarding; Synchronization; Checkpointing; Network simulation

▪ Validation; Speedup

-- 10:15 AM to 10:45AM -- Break --

▪ Getting started with dist-gem5

▪ Prerequisites; Compiling; Running example script

▪ Launch script walk through; Checkpointing/restoring

▪ Network modeling

▪ Debugging

▪ Preparing benchmarks

▪ Apache bench

▪ Demo

Programme

Page 100: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

104

1. Install/build the benchmark on disk-image

▪ Mount disk-image, copy source, chroot to disk-image, build source OR

▪ Mount disk-image, chroot to disk-image, install application using “apt-get install” OR

▪ Cross compile application, mount disk-image, copy application binary to disk-image

Steps to prepare a benchmark (e.g. apache-bench) on dist-gem5

Page 101: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

105

2. Take a checkpoint

3. Restore from checkpoint and run the benchmark

▪ Example rcS for running apache-bench; one master node and multiple slaves:

Steps to prepare a benchmark (e.g. apache-bench) on dist-gem5

Page 102: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

106

▪ Response time of a request is ts1 – ts0

▪ Monitoring response time at software disturbs

client application

▪ Client side queueing

▪ Imprecise statistics

▪ Solution:

▪ Use m5 pseudo instruction to measure response time

Annotating benchmarks for accurate stat collection

Request Response

ResponseRequest

Network

Clients

Servers

Client Server

ts1

ts0 request

respond

Page 103: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

107

▪ Call m5_work_begin(req.id) before sending a request

▪ Call m5_work_end(req.id) when receiving a response

▪ Example output:

Annotating benchmarks for accurate stat collection

Client Server

m5_work_end(req.id)

m5_work_begin(req.id)request

respond

work_item_end reports the round trip latency

Page 104: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

108

1. Generate libm5.a for your desired ISA

▪ E.g for aarch64:

2. Copy “libm5.a” and “m5op.h” to the application’s root dir

3. Include “m5op.h” in the source code of the application

4. Add the desired m5ops to application source code

5. Add “-L. –lm5” flags to the gcc compiler and build the application

Steps to annotate an application using m5 ops

Page 105: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

109

Sample ab.c annotation with m5 ops

C->reqId is a unique ID for each request

Page 106: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

110

▪ Introduction

▪ dist-gem5 architecture

▪ Packet forwarding; Synchronization; Checkpointing; Network simulation

▪ Validation; Speedup

-- 10:15 AM to 10:45AM -- Break --

▪ Getting started with dist-gem5

▪ Prerequisites; Compiling; Running example script

▪ Launch script walk through; Checkpointing/restoring

▪ Network modeling

▪ Debugging

▪ Preparing benchmarks

▪ Apache bench

▪ Demo

Programme

Page 107: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

111

▪ If you have questions, you can post your questions in gem5 mailing lists

▪ Or contact us directly

▪ Mohammad Alian [email protected]

▪ Gabor Dozsa [email protected]

▪ Please check dist-gem5 web-page for more updates and resources for dist-gem5

▪ https://publish.illinois.edu/icsl-pdgem5/

Thank you

Page 108: dist-gem5: Distributed Simulation of Computer …publish.illinois.edu/icsl-pdgem5/files/2015/08/iiswc17-tutorial...dist-gem5: Distributed Simulation of Computer Clusters ... Network

112

Dist-gem5: Distributed Simulate of

Computer Clusters

Illinois: Mohammad Alian, Prof. Nam Sung Kim

ARM: Gabor Dozsa, Stephan Diestelhorst, Nikos Nikoleris, Radhika Jagtap

Tutorial at IEEE International Symposium on Workload Characterization (IISWC), Seattle, USA

1 Oct 2017