Parallel SimOS: Scalability and Performance for Large System...

Preview:

Citation preview

Parallel SimOS: Scalability and Performance for

Large System Simulation

Ph.D. Oral DefenseRobert E. Lantz

Computer Systems LaboratoryStanford University

1

Overview

• This work develops methods to simulate large computer systems with practical performance

• We use smaller machines to simulate larger machines

• We extend the capabilities of computer system simulation by an order of magnitude, to systems of more than 1000 processors

2

Outline

• Background and Motivation

• Parallel SimOS Investigation

• Design Issues and Experiences

• Performance Evaluation

• Usability Evaluation

• Related Work

• Future Work and Conclusions

3

Why large systems?

• Large applications!

• Biology, Chemistry, Physics, Engineering

• From large systems (e.g. Earth’s climate) to small systems (e.g. cells, DNA)

• Web applications, search, databases

• Simulation, visualization (and games!)

4

Why simulate large systems?

• Compare alternative designs

• Verify a system before building it

• Predict behavior and performance

• Debug a system during bring-up

• Write software when the system is not available (or before it exists!)

• Avoid expensive mistakes

5

The SimOS System• Complete Machine Simulator

developed in CSL

• Simulates complete hardware of computer system: CPU, memory, devices

• Enough speed and detail to run full operating system, system software, application programs

• Multiple CPU and memory models for fast or detailed performance and behavioral modeling

Target Workload

SimOS Simulated Hardware

Device Models

Disk Network Other

CPU Model

P P P P

Memory Model

M M M M

Target OS

Host OS

Host Hardware

6

Using SimOS

SimOS

Config/Control Scripts

ExternalI/O

Disk Image

OS, System Software

UserApplications

ApplicationData

Modeled performance

and event statistics

Programoutput

Simulatorstatistics

7

Performance Terminology

• Execution time is the most meaningful measurement of simulator performance

• Slowdown = Real Time/Simulated Time

• Slowdown tells you how much longer it will take to simulate a workload compared to running it on actual hardware

• Self-relative slowdown compares a simulator with the machine it is running on

8

Speed/Detail Trade-off

SimOS CPU and Memory Models

SimOS CPU Model Detail Approximate KIPS (225 MHz R10K)

Self-relative slowdown

MXS

Dynamic, superscalar microarchitecture

model; non-blocking memory system

12 2000+

MipsySequential interpreter;

blocking memory system

800 300+

Embra w/cachesSingle-cycle CPU

model; simplified cache model

12000 ~20

Embra Single-cycle CPU and memory model 25000 ~10

9

Benefits of fast simulation

• Makes it possible to simulate complex workloads

• Real OS, system software, large applications

• Many billions of cycles

• Positioning before more detailed simulation

• Allows software development, debugging

• interactive usability

• Enables exploration of large design space

• Provides rough estimate of performance, trends10

SimOS Applications

• Used in design, development, debugging of Stanford FLASH multiprocessor throughout its life cycle

• Enabled numerous studies of OS and application performance

• Research platform for operating systems, virtual machines, visualization

11

SimOS Limitations

• As we simulate larger machines,slowdown increases

0

5,000

10,000

15,000

1 128 1024

Simulated Processors

BarnesFFTRadixLU

Slowdown (real time /

simulated time)

12

SimOS Limitations• ...resulting in longer simulation times

0

5,000

10,000

15,000

1 128 1024

Simulated Processors

Time (minutes) to simulate

one minute of virtual time

10 minutes23 hours

> 1 week

13

Problem: Simulator Slowdown

• What causes simulator slowdown?

• Intrinsic Slowdown

• Resource Exhaustion

• Linear slowdown

• Overall multiplicative slowdown:

Simulation Time = Workload Time * (Intrinsic Slowdown + Resource Exhaustion Penalty) * Linear Slowdown

14

Solution: Parallel SimOS

• Use increased capacity of shared-memory multiprocessors to address resource exhaustion and linear slowdown

• Extend speed/detail trade-off with fast, parallel mode of simulation

• Goal: eliminate slowdown due to parallelism and increase scalability to enable large system simulation with practical performance

15

Outline• Background and Motivation

• Parallel SimOS Investigation

• Design Issues and Experiences

• Embra background

• Parallel Embra Design

• Performance Evaluation

• Usability Evaluation

• Related Work

• Future Work and Conclusions

16

Embra: SimOS’ fastest simulation mode

• Binary translation CPU and memory simulator

• Translation Cache (TC)

• Callouts to handle events, MMU operations, exceptions and annotations

• CPU multiplexing

• ~10x base slowdownEmbra

Translation Cache (TC)

MMU/glue code

Kernel TC

User TC

Translation Cache (TC) index

MMU Cache

MMU Handler

StatisticsReporting

Translator

Decoder

Callout andException Handlers

Event Handlers

SimOSInterface

17

Embra: sources of slowdown

• Binary translation overhead

• Multiplexing overhead

• Resource Exhaustion

ST = WT * (Slowdown(I) + Slowdown(R)) * M

18

Binary translation overhead

Simulator Memory

Translation Cache (TC)

lw r1, (r2)

lw r3, (r4)

add r5, r1, r3

lw SIM_T1, R2(cpu_base)

jal mem_read_addr

lw SIM_T2, (SIM_T1)

sw SIM_T2, R1(cpu_base)

lw SIM_T1, R4(cpu_base)

jal mem_read_addr

lw SIM_T3, (SIM_T1)

sw SIM_T3, R3(cpu_base)

add.w SIM_T1, SIM_T2, SIM_T3

sw SIM_T1, R5(cpu_base)

TC Index

Decoder and

Translator

PC

19

CPU multiplexing overhead

• CPU State array

• Context switching with variable timeslice

• large for low overhead

• small for better responsiveness

• minimal: MPinUP mode

CPU 0

Registers FPU

MMU other state

CPU 1

Registers FPU

MMU other state

CPU 2

Registers FPU

MMU other state

P

P

P

20

A new, faster mode:Parallel Embra

• Use parallelism and memory system of shared-memory multiprocessor

• Decimation-in-space approach

• Parallelism and increased memory bandwidth reduce linear slowdown and resource exhaustion:

ST = WT * (Slowdown(I) + Slowdown(R)) * M

Simulated

nodes

Simulator

threads

21

Design Evolution

• We started with a baseline design and evolved it to achieve scalable performance

• Baseline: thread-based parallelism, shared memory

• Critical design features:

• Mirroring hardware in software

• Replication, fine-grained parallelism

• Unsynchronized execution speed

22

Design: Software should mirror Hardware

• Shared Translation Cache to reduce overhead?

• Problem: contention and serialization; chaining and cache conflicts

• Fuses hardware, breaks parallelism

• Solution: mirror hardware in software with replicated Translation Caches

Parallel Embra

Translation Cache (TC)

MMU/glue code

Kernel TC

User TC

Translation Cache (TC) index

MMU Cache

MMU Handler

StatisticsReporting

Translator

Decoder

Callout andException Handlers

Event Handlers

SimOSInterface

Translation Cache (TC)

MMU/glue code

Kernel TC

User TC

Translation Cache (TC) index

Translation Cache (TC)

MMU/glue code

Kernel TC

User TC

Translation Cache (TC) index

23

Parallel Embra

Translation Cache (TC)

MMU/glue code

Kernel TC

User TC

Translation Cache (TC) index

MMU Cache

MMU Handler

StatisticsReporting

Translator

Decoder

Callout andException Handlers

Event Handlers

SimOSInterface

Design: Software should mirror Hardware

• Shared Event Queue for global ordering? Events are rare!

• Problem: event frequency increases with parallelism

• Solution: replicated event queues to mirror hardware in software

24

Design: Software should mirror Hardware

• 90% of time in TC - how about parallelize TC only?

• Problem: Amdahl’s law

• Problem: frequent callouts, contention everywhere

• Result: critical region expansion and serialization

Parallel Embra

Translation Cache (TC)

MMU/glue code

Kernel TC

User TC

Translation Cache (TC) index

MMU Cache

MMU Handler

StatisticsReporting

Translator

Decoder

Callout andException Handlers

Event Handlers

SimOSInterface

25

Critical Region Expansion

Critical Regions

Expansion and Serialization

Contention and Descheduling

Time26

Design: Software should mirror Hardware

• Solution: mirror hardware in software with fine-grained parallelism throughout Parallel Embra

• OS and apps require parallel callouts from Translation Cache

• Parallel statistics reporting is also a good idea, but happens infrequently

Parallel Embra

Translation Cache (TC)

MMU/glue code

Kernel TC

User TC

Translation Cache (TC) index

MMU Cache

MMU Handler

StatisticsReporting

Translator

Decoder

Callout andException Handlers

Event Handlers

SimOSInterface

27

Design: flexible virtual time synchronization

• Problem: cycle skew between fast, slow processors

• Solution: configurable barrier synchronization

• fast processors wait for slow processors

• fine-grain (like MPinUP mode)

• loose grain (reduce sync overhead)

• variable interval for flexibility

28

Design: synchronization causes slowdown

0

1

2

3

4

500000 1000000 10000000

BarnesFFTLUMP3DOceanRaytraceRadixWater

32p Slowdown

vs. large sync interval

Synchronization interval (cycles)

29

Design: unsynchronized execution

• For performance, the best synchronization interval is longer than the workload, i.e. never synchronize

• We were surprised to find that both the OS and parallel benchmarks ran correctly with unlimited time skew

• This is because every thread sees a consistent ordering of memory and synchronization events

30

Design conclusions• Parallelism increases

contention for: callouts, event system, TC, clock, MMU, interrupt controllers, any shared subsystem

• Contention cascades, resulting in critical region expansion and serialization

• Mirroring hardware in software preserves parallelism, avoids contention effects

• Fine-grained synchronization is required to permit correct and highly parallel access to simulator data

• Time synchronization across processors is unnecessary for correctness and undesirable for speed

• Performance depends on combination of all parallel performance features

31

Outline• Background and Motivation

• Parallel SimOS Investigation

• Design Issues and Experiences

• Performance Evaluation

• Usability Evaluation

• Related Work

• Future Work

• Conclusions

32

Performance:Test Configuration

benchmark description

Barnes Hierarchical Barnes-Hut method for N-body problem

FFT Fast Fourier Transform

LU Lower/Upper matrix factorization

MP3D Particle-based hypersonic wind tunnel simulation

Radix Integer radix sort

Raytrace Ray tracer

Ocean Ocean currents simulation

Water Water molecule simulation

pmake Compile phase of Modified Andrew Benchmark

ptest Simple benchmark for sanity check/peak performance

Stanford FLASH Multiprocessor

64 nodesMIPS R10000, 225 Mhz220 MB DRAM/node (14GB total)flash1, flash32, flash64, etc.

WorkloadMachine33

Performance: Peak and actual MIPS

0

100

200

300

400

500

600

700

800

900

1000

1100

0 200 400 600 800 1000 1200

MIPS over time

-vpc-suite/flash-32-suite.log

Flash32: ptest Flash32: SPLASH-2Overall result: > 1000 MIPS in simulation, ~10x slowdown compared to hardware

1600 MIPS1000 MIPS

34

Performance: Hardware self-relative slowdown

0

10

20

30

40

50

60

1 2 4 8 16 32 64

BarnesFFTLUMP3DOceanRadixRaytraceWaterpmakeLU-bigRadix-big

Self-relative slowdown

Simulated Machine Size

~10x slowdown regardless of machine size35

Performance: benchmark phases

Barnes-Flash32 LU-Flash32

36

Performance: benchmark phases

MP3D-Flash3237

Large Scale Performance

-1024 processor simulation, 16x64p cluster,8-way parallel

38

Large Scale Performance

0

2,500

5,000

7,500

10,000

12,500

15,000

SimOS Parallel SimOS

442

9,409

772

10,323Radix/Flash32LU/Flash64

Slowdown(Real time/

simulated time)

Hours or days rather than weeks

39

Speed/Detail Trade-off, revisited

Parallel SimOS CPU and Memory Models

Parallel SimOS CPU Model Detail Approximate KIPS

(225 MHz R10K)Self-relative

slowdown

MXS

Dynamic, superscalar microarchitecture

model; non-blocking memory system

12 2000+

MipsySequential interpreter;

blocking memory system

800 300+

Embra w/cachesSingle-cycle CPU

model; simplified cache model

12000 ~20

Embra Single-cycle CPU and memory model 25000 ~10

Parallel EmbraNon-deterministic,

single-cycle CPU and memory model

> 1,000,000 ~10

40

Performance Conclusions

• Parallel SimOS achieves peak and actual MIPS far beyond serial SimOS

• Parallel SimOS simulates multiprocessor with analogous performance to Serial SimOS simulating a uniprocessor

• Parallel SimOS extends scalability of complete machine simulation to 1024 processor systems

41

• Study of large, complex parallel program: Parallel SimOS itself

• Self-hosting capability of orthogonal simulators

• Performance debugging of Parallel SimOS, and test of functionality and usability

• Self-hosting architecture:Hardware (SGI Origin)

Usability Study

Outer SimOS

Outer Irix 6.5

Irix 6.5

Inner SimOS

Inner Irix 6.5

Benchmark (Radix)

42

Phase profile

|40

|45

|50

|55

|60

|65

|70

|75

|80

|

0

|

1

|

2

|

3

|

4

Computation intervals for self-hosted radix

time(s)

CP

U

|11

|13

|15

|17

|19

|21

|23

|25

|

0

|

1

|

2

|

3

|

4

Computation intervals for self-hosted radix

time(s)

CP

U

Serial SimOS Parallel SimOS

Bugs: Excessive TLB misses, interrupt stormsLimitation: system imbalance effects

43

Usability Conclusions

• Parallel SimOS worked correctly on itself

• Revealed bugs and limitations of Parallel SimOS

• Speed/detail trade-off enabled with checkpoints

• Detailed mode too slow - ended up scaling down workload

• Need for faster detailed simulation modes

44

Limitations

• Virtual time depends on real time

• Loss of determinism, repeatability

• but can use checkpoints!

• System Imbalance Effects

• Memory Limits

• Need for fast detailed mode

• future work

45

Related Work

• Parallel SimOS uses shared-memory multiprocessors and decimation in space

• Other approaches to improving performance using parallelism include:

• Decimation in time

• Cluster-based simulation

46

Related Work: Decimation in Time

ST = WT * (Slowdown(I) + Slowdown(R)) * N

Initial serial execution Segment 1 Segment 2 Segment 3 Segment 4

checkpoint checkpoint checkpoint

checkpoint

Subsequent parallel execution

Serial reconstruction

Segment 1

Segment 2

Segment 3

Segment 4

overlap

Segment 1

checkpoint

Segment 2

Segment 3

checkpoint checkpoint checkpoint

Segment 4

47

Parallel SimOS: Decimation in Space

Simulated

nodes

Simulator

threads

ST = WT * (Slowdown(I) + Slowdown(R)) * M48

Related Work: Cluster-based Simulation

• Most common means of parallel simulation: Shaman, BigSim, others;

• Fast (?) LAN = high-latency communication

• Software-based shared memory = low performance

• Reduced flexibility

switch

49

Parallel SimOS: Flexible Simulation

• Tightly and loosely coupled machines

• From clusters to multiprocessors and everything in between

• Parallelism across multiprocessor nodes

Workstation Cluster - “Sweet Hall”

Network

Multi-level Bus/interconnect

Node

CPU Cache

CPU Cache

Memory Controller

NUMA Shared-Memory Multiprocessor - Stanford FLASH Machine

Node

CPU Cache

CPU Cache

Memory Controller

Node

CPU Cache

CPU Cache

Memory Controller

Node

CPU Cache

CPU Cache

Memory Controller

...

Multiprocessor Cluster

...Network

Multiprocessor

NetworkInterface

Node

Node

Node

Node

Multiprocessor

NetworkInterface

Node

Node

Node

Node

50

Related Work Summary• Decimation in Time achieves good speedup

at the expense of interactivity

• synergistic with Parallel SimOS

• Cluster-based simulation addresses needs of loosely-coupled systems, generally without shared memory

• Parallel SimOS approach achieves programmability and performance - for larger design space that includes tightly-coupled and hybrid systems

51

Future Work

• Faster detailed simulation

• Parallel detailed mode with flexible memory, pipeline models

• Try to recapture determinism

• Global memory ordering in virtual time

• Faster less-detailed simulation

• Revisit direct execution, using virtual machine monitors, user-mode OS, etc.

52

Conclusion: Thesis Contributions

• Developed design and implementation of scalable, parallel complete machine simulation

• Eliminated slowdown due to resource exhaustion and multiplexing

• Scaled complete machine simulation up by an order of magnitude - 1024 processor machines on our hardware

• Developed flexible simulator capable of simulating large, tightly-coupled systems with interactive performance

53