View
5
Download
0
Category
Preview:
Citation preview
Parallel SimOS: Scalability and Performance for
Large System Simulation
Ph.D. Oral DefenseRobert E. Lantz
Computer Systems LaboratoryStanford University
1
Overview
• This work develops methods to simulate large computer systems with practical performance
• We use smaller machines to simulate larger machines
• We extend the capabilities of computer system simulation by an order of magnitude, to systems of more than 1000 processors
2
Outline
• Background and Motivation
• Parallel SimOS Investigation
• Design Issues and Experiences
• Performance Evaluation
• Usability Evaluation
• Related Work
• Future Work and Conclusions
3
Why large systems?
• Large applications!
• Biology, Chemistry, Physics, Engineering
• From large systems (e.g. Earth’s climate) to small systems (e.g. cells, DNA)
• Web applications, search, databases
• Simulation, visualization (and games!)
4
Why simulate large systems?
• Compare alternative designs
• Verify a system before building it
• Predict behavior and performance
• Debug a system during bring-up
• Write software when the system is not available (or before it exists!)
• Avoid expensive mistakes
5
The SimOS System• Complete Machine Simulator
developed in CSL
• Simulates complete hardware of computer system: CPU, memory, devices
• Enough speed and detail to run full operating system, system software, application programs
• Multiple CPU and memory models for fast or detailed performance and behavioral modeling
Target Workload
SimOS Simulated Hardware
Device Models
Disk Network Other
CPU Model
P P P P
Memory Model
M M M M
Target OS
Host OS
Host Hardware
6
Using SimOS
SimOS
Config/Control Scripts
ExternalI/O
Disk Image
OS, System Software
UserApplications
ApplicationData
Modeled performance
and event statistics
Programoutput
Simulatorstatistics
7
Performance Terminology
• Execution time is the most meaningful measurement of simulator performance
• Slowdown = Real Time/Simulated Time
• Slowdown tells you how much longer it will take to simulate a workload compared to running it on actual hardware
• Self-relative slowdown compares a simulator with the machine it is running on
8
Speed/Detail Trade-off
SimOS CPU and Memory Models
SimOS CPU Model Detail Approximate KIPS (225 MHz R10K)
Self-relative slowdown
MXS
Dynamic, superscalar microarchitecture
model; non-blocking memory system
12 2000+
MipsySequential interpreter;
blocking memory system
800 300+
Embra w/cachesSingle-cycle CPU
model; simplified cache model
12000 ~20
Embra Single-cycle CPU and memory model 25000 ~10
9
Benefits of fast simulation
• Makes it possible to simulate complex workloads
• Real OS, system software, large applications
• Many billions of cycles
• Positioning before more detailed simulation
• Allows software development, debugging
• interactive usability
• Enables exploration of large design space
• Provides rough estimate of performance, trends10
SimOS Applications
• Used in design, development, debugging of Stanford FLASH multiprocessor throughout its life cycle
• Enabled numerous studies of OS and application performance
• Research platform for operating systems, virtual machines, visualization
11
SimOS Limitations
• As we simulate larger machines,slowdown increases
0
5,000
10,000
15,000
1 128 1024
Simulated Processors
BarnesFFTRadixLU
Slowdown (real time /
simulated time)
12
SimOS Limitations• ...resulting in longer simulation times
0
5,000
10,000
15,000
1 128 1024
Simulated Processors
Time (minutes) to simulate
one minute of virtual time
10 minutes23 hours
> 1 week
13
Problem: Simulator Slowdown
• What causes simulator slowdown?
• Intrinsic Slowdown
• Resource Exhaustion
• Linear slowdown
• Overall multiplicative slowdown:
Simulation Time = Workload Time * (Intrinsic Slowdown + Resource Exhaustion Penalty) * Linear Slowdown
14
Solution: Parallel SimOS
• Use increased capacity of shared-memory multiprocessors to address resource exhaustion and linear slowdown
• Extend speed/detail trade-off with fast, parallel mode of simulation
• Goal: eliminate slowdown due to parallelism and increase scalability to enable large system simulation with practical performance
15
Outline• Background and Motivation
• Parallel SimOS Investigation
• Design Issues and Experiences
• Embra background
• Parallel Embra Design
• Performance Evaluation
• Usability Evaluation
• Related Work
• Future Work and Conclusions
16
Embra: SimOS’ fastest simulation mode
• Binary translation CPU and memory simulator
• Translation Cache (TC)
• Callouts to handle events, MMU operations, exceptions and annotations
• CPU multiplexing
• ~10x base slowdownEmbra
Translation Cache (TC)
MMU/glue code
Kernel TC
User TC
Translation Cache (TC) index
MMU Cache
MMU Handler
StatisticsReporting
Translator
Decoder
Callout andException Handlers
Event Handlers
SimOSInterface
17
Embra: sources of slowdown
• Binary translation overhead
• Multiplexing overhead
• Resource Exhaustion
ST = WT * (Slowdown(I) + Slowdown(R)) * M
18
Binary translation overhead
Simulator Memory
Translation Cache (TC)
lw r1, (r2)
lw r3, (r4)
add r5, r1, r3
lw SIM_T1, R2(cpu_base)
jal mem_read_addr
lw SIM_T2, (SIM_T1)
sw SIM_T2, R1(cpu_base)
lw SIM_T1, R4(cpu_base)
jal mem_read_addr
lw SIM_T3, (SIM_T1)
sw SIM_T3, R3(cpu_base)
add.w SIM_T1, SIM_T2, SIM_T3
sw SIM_T1, R5(cpu_base)
TC Index
Decoder and
Translator
PC
19
CPU multiplexing overhead
• CPU State array
• Context switching with variable timeslice
• large for low overhead
• small for better responsiveness
• minimal: MPinUP mode
CPU 0
Registers FPU
MMU other state
CPU 1
Registers FPU
MMU other state
CPU 2
Registers FPU
MMU other state
P
P
P
20
A new, faster mode:Parallel Embra
• Use parallelism and memory system of shared-memory multiprocessor
• Decimation-in-space approach
• Parallelism and increased memory bandwidth reduce linear slowdown and resource exhaustion:
ST = WT * (Slowdown(I) + Slowdown(R)) * M
Simulated
nodes
Simulator
threads
21
Design Evolution
• We started with a baseline design and evolved it to achieve scalable performance
• Baseline: thread-based parallelism, shared memory
• Critical design features:
• Mirroring hardware in software
• Replication, fine-grained parallelism
• Unsynchronized execution speed
22
Design: Software should mirror Hardware
• Shared Translation Cache to reduce overhead?
• Problem: contention and serialization; chaining and cache conflicts
• Fuses hardware, breaks parallelism
• Solution: mirror hardware in software with replicated Translation Caches
Parallel Embra
Translation Cache (TC)
MMU/glue code
Kernel TC
User TC
Translation Cache (TC) index
MMU Cache
MMU Handler
StatisticsReporting
Translator
Decoder
Callout andException Handlers
Event Handlers
SimOSInterface
Translation Cache (TC)
MMU/glue code
Kernel TC
User TC
Translation Cache (TC) index
Translation Cache (TC)
MMU/glue code
Kernel TC
User TC
Translation Cache (TC) index
23
Parallel Embra
Translation Cache (TC)
MMU/glue code
Kernel TC
User TC
Translation Cache (TC) index
MMU Cache
MMU Handler
StatisticsReporting
Translator
Decoder
Callout andException Handlers
Event Handlers
SimOSInterface
Design: Software should mirror Hardware
• Shared Event Queue for global ordering? Events are rare!
• Problem: event frequency increases with parallelism
• Solution: replicated event queues to mirror hardware in software
24
Design: Software should mirror Hardware
• 90% of time in TC - how about parallelize TC only?
• Problem: Amdahl’s law
• Problem: frequent callouts, contention everywhere
• Result: critical region expansion and serialization
Parallel Embra
Translation Cache (TC)
MMU/glue code
Kernel TC
User TC
Translation Cache (TC) index
MMU Cache
MMU Handler
StatisticsReporting
Translator
Decoder
Callout andException Handlers
Event Handlers
SimOSInterface
25
Critical Region Expansion
Critical Regions
Expansion and Serialization
Contention and Descheduling
Time26
Design: Software should mirror Hardware
• Solution: mirror hardware in software with fine-grained parallelism throughout Parallel Embra
• OS and apps require parallel callouts from Translation Cache
• Parallel statistics reporting is also a good idea, but happens infrequently
Parallel Embra
Translation Cache (TC)
MMU/glue code
Kernel TC
User TC
Translation Cache (TC) index
MMU Cache
MMU Handler
StatisticsReporting
Translator
Decoder
Callout andException Handlers
Event Handlers
SimOSInterface
27
Design: flexible virtual time synchronization
• Problem: cycle skew between fast, slow processors
• Solution: configurable barrier synchronization
• fast processors wait for slow processors
• fine-grain (like MPinUP mode)
• loose grain (reduce sync overhead)
• variable interval for flexibility
28
Design: synchronization causes slowdown
0
1
2
3
4
500000 1000000 10000000
BarnesFFTLUMP3DOceanRaytraceRadixWater
32p Slowdown
vs. large sync interval
Synchronization interval (cycles)
29
Design: unsynchronized execution
• For performance, the best synchronization interval is longer than the workload, i.e. never synchronize
• We were surprised to find that both the OS and parallel benchmarks ran correctly with unlimited time skew
• This is because every thread sees a consistent ordering of memory and synchronization events
30
Design conclusions• Parallelism increases
contention for: callouts, event system, TC, clock, MMU, interrupt controllers, any shared subsystem
• Contention cascades, resulting in critical region expansion and serialization
• Mirroring hardware in software preserves parallelism, avoids contention effects
• Fine-grained synchronization is required to permit correct and highly parallel access to simulator data
• Time synchronization across processors is unnecessary for correctness and undesirable for speed
• Performance depends on combination of all parallel performance features
31
Outline• Background and Motivation
• Parallel SimOS Investigation
• Design Issues and Experiences
• Performance Evaluation
• Usability Evaluation
• Related Work
• Future Work
• Conclusions
32
Performance:Test Configuration
benchmark description
Barnes Hierarchical Barnes-Hut method for N-body problem
FFT Fast Fourier Transform
LU Lower/Upper matrix factorization
MP3D Particle-based hypersonic wind tunnel simulation
Radix Integer radix sort
Raytrace Ray tracer
Ocean Ocean currents simulation
Water Water molecule simulation
pmake Compile phase of Modified Andrew Benchmark
ptest Simple benchmark for sanity check/peak performance
Stanford FLASH Multiprocessor
64 nodesMIPS R10000, 225 Mhz220 MB DRAM/node (14GB total)flash1, flash32, flash64, etc.
WorkloadMachine33
Performance: Peak and actual MIPS
0
100
200
300
400
500
600
700
800
900
1000
1100
0 200 400 600 800 1000 1200
MIPS over time
-vpc-suite/flash-32-suite.log
Flash32: ptest Flash32: SPLASH-2Overall result: > 1000 MIPS in simulation, ~10x slowdown compared to hardware
1600 MIPS1000 MIPS
34
Performance: Hardware self-relative slowdown
0
10
20
30
40
50
60
1 2 4 8 16 32 64
BarnesFFTLUMP3DOceanRadixRaytraceWaterpmakeLU-bigRadix-big
Self-relative slowdown
Simulated Machine Size
~10x slowdown regardless of machine size35
Performance: benchmark phases
Barnes-Flash32 LU-Flash32
36
Performance: benchmark phases
MP3D-Flash3237
Large Scale Performance
-1024 processor simulation, 16x64p cluster,8-way parallel
38
Large Scale Performance
0
2,500
5,000
7,500
10,000
12,500
15,000
SimOS Parallel SimOS
442
9,409
772
10,323Radix/Flash32LU/Flash64
Slowdown(Real time/
simulated time)
Hours or days rather than weeks
39
Speed/Detail Trade-off, revisited
Parallel SimOS CPU and Memory Models
Parallel SimOS CPU Model Detail Approximate KIPS
(225 MHz R10K)Self-relative
slowdown
MXS
Dynamic, superscalar microarchitecture
model; non-blocking memory system
12 2000+
MipsySequential interpreter;
blocking memory system
800 300+
Embra w/cachesSingle-cycle CPU
model; simplified cache model
12000 ~20
Embra Single-cycle CPU and memory model 25000 ~10
Parallel EmbraNon-deterministic,
single-cycle CPU and memory model
> 1,000,000 ~10
40
Performance Conclusions
• Parallel SimOS achieves peak and actual MIPS far beyond serial SimOS
• Parallel SimOS simulates multiprocessor with analogous performance to Serial SimOS simulating a uniprocessor
• Parallel SimOS extends scalability of complete machine simulation to 1024 processor systems
41
• Study of large, complex parallel program: Parallel SimOS itself
• Self-hosting capability of orthogonal simulators
• Performance debugging of Parallel SimOS, and test of functionality and usability
• Self-hosting architecture:Hardware (SGI Origin)
Usability Study
Outer SimOS
Outer Irix 6.5
Irix 6.5
Inner SimOS
Inner Irix 6.5
Benchmark (Radix)
42
Phase profile
|40
|45
|50
|55
|60
|65
|70
|75
|80
|
0
|
1
|
2
|
3
|
4
Computation intervals for self-hosted radix
time(s)
CP
U
|11
|13
|15
|17
|19
|21
|23
|25
|
0
|
1
|
2
|
3
|
4
Computation intervals for self-hosted radix
time(s)
CP
U
Serial SimOS Parallel SimOS
Bugs: Excessive TLB misses, interrupt stormsLimitation: system imbalance effects
43
Usability Conclusions
• Parallel SimOS worked correctly on itself
• Revealed bugs and limitations of Parallel SimOS
• Speed/detail trade-off enabled with checkpoints
• Detailed mode too slow - ended up scaling down workload
• Need for faster detailed simulation modes
44
Limitations
• Virtual time depends on real time
• Loss of determinism, repeatability
• but can use checkpoints!
• System Imbalance Effects
• Memory Limits
• Need for fast detailed mode
• future work
45
Related Work
• Parallel SimOS uses shared-memory multiprocessors and decimation in space
• Other approaches to improving performance using parallelism include:
• Decimation in time
• Cluster-based simulation
46
Related Work: Decimation in Time
ST = WT * (Slowdown(I) + Slowdown(R)) * N
Initial serial execution Segment 1 Segment 2 Segment 3 Segment 4
checkpoint checkpoint checkpoint
checkpoint
Subsequent parallel execution
Serial reconstruction
Segment 1
Segment 2
Segment 3
Segment 4
overlap
Segment 1
checkpoint
Segment 2
Segment 3
checkpoint checkpoint checkpoint
Segment 4
47
Parallel SimOS: Decimation in Space
Simulated
nodes
Simulator
threads
ST = WT * (Slowdown(I) + Slowdown(R)) * M48
Related Work: Cluster-based Simulation
• Most common means of parallel simulation: Shaman, BigSim, others;
• Fast (?) LAN = high-latency communication
• Software-based shared memory = low performance
• Reduced flexibility
switch
49
Parallel SimOS: Flexible Simulation
• Tightly and loosely coupled machines
• From clusters to multiprocessors and everything in between
• Parallelism across multiprocessor nodes
Workstation Cluster - “Sweet Hall”
Network
Multi-level Bus/interconnect
Node
CPU Cache
CPU Cache
Memory Controller
NUMA Shared-Memory Multiprocessor - Stanford FLASH Machine
Node
CPU Cache
CPU Cache
Memory Controller
Node
CPU Cache
CPU Cache
Memory Controller
Node
CPU Cache
CPU Cache
Memory Controller
...
Multiprocessor Cluster
...Network
Multiprocessor
NetworkInterface
Node
Node
Node
Node
Multiprocessor
NetworkInterface
Node
Node
Node
Node
50
Related Work Summary• Decimation in Time achieves good speedup
at the expense of interactivity
• synergistic with Parallel SimOS
• Cluster-based simulation addresses needs of loosely-coupled systems, generally without shared memory
• Parallel SimOS approach achieves programmability and performance - for larger design space that includes tightly-coupled and hybrid systems
51
Future Work
• Faster detailed simulation
• Parallel detailed mode with flexible memory, pipeline models
• Try to recapture determinism
• Global memory ordering in virtual time
• Faster less-detailed simulation
• Revisit direct execution, using virtual machine monitors, user-mode OS, etc.
52
Conclusion: Thesis Contributions
• Developed design and implementation of scalable, parallel complete machine simulation
• Eliminated slowdown due to resource exhaustion and multiplexing
• Scaled complete machine simulation up by an order of magnitude - 1024 processor machines on our hardware
• Developed flexible simulator capable of simulating large, tightly-coupled systems with interactive performance
53
Recommended