Upload
bianca-atkins
View
53
Download
1
Embed Size (px)
DESCRIPTION
Introduction to SimpleScalar (Based on SimpleScalar Tutorial). CSCE614 Texas A&M University. Overview. What is an architectural simulator a tool that reproduces the behavior of a computing device Why use a simulator Leverage a faster, more flexible software development cycle - PowerPoint PPT Presentation
Citation preview
23年 4月 20日 1
Introduction to SimpleScalar (Based on SimpleScalar Tutorial)
CSCE614Texas A&M University
23年 4月 20日 2
Overview
• What is an architectural simulator– a tool that reproduces the behavior of a computing device
• Why use a simulator– Leverage a faster, more flexible software development cycle
• Permit more design space exploration
• Facilitates validation before H/W becomes available
• Level of abstraction is tailored by design task
• Possible to increase/improve system instrumentation
• Usually less expensive than building a real system
23年 4月 20日 3
Advantages of SimpleScalar
• Highly flexible– functional simulator + performance simulator
• Portable– Host: virtual target runs on most Unix-like systems– Target: simulators can support multiple ISAs
• Extensible– Source is included for compiler, libraries, simulators– Easy to write simulators
• Performance– Runs codes approaching ‘real’ sizes
23年 4月 20日 4
Simulation Tools
Shaded tools are included in SimpleScalar Tool Set
Trace-Driven
Interpreters
Exec-Driven
Functional
Inst Schedulers Cycle Timers
Performance
Architectural Simulators
Direct Execution
23年 4月 20日 5
Functional vs. Performance Simulators
• Functional simulators implement the architecture– perform real execution
– Implement what programmers see
• Performance simulators implement the microarchitecture– Model system resources/internals
– Concern about time
– Do not implement what programmers see
23年 4月 20日 6
Trace Driven vs. Execution Driven Simulators
• Trace-Driven– Simulator reads a ‘trace’ of the instructions captured during a
previous execution– Easy to implement– No functional components necessary– No feedback to trace (eg. mis-prediction)
• Execution-Driven– Simulator runs the program (trace-on-the-fly)– Hard to implement– Advantages
• Faster than tracing• No need to store traces• Register and memory values usually are not in trace• Support mis-speculation cost modeling
23年 4月 20日 7
Instruction Schedulers vs. Cycle Timers
• Instruction Schedulers– Simulator schedules instruction when resources are available
– Instructions proceeded one at a time
– Simpler, but less detailed
• Cycle Timers– Simulator tracks microarch. state each cycle
– Simulator state == microarchitecture state
– Perfect for microarchitecture simulation
23年 4月 20日 8
SimpleScalar Release 3.0
• SimpleScalar now executes multiple instruction sets: SimpleScalar PISA (the old "SimpleScalar ISA") and Alpha AXP.
• All simulators now support external I/O traces (EIO traces). Generated with a new simulator (sim-eio)
• Support more platforms
• explicit fault support
• And many more
23年 4月 20日 9
Simulator Suite
Sim-Fast Sim-Safe Sim-ProfileSim-Cache
Sim-CheetahSim-BPred
Sim-Outorder
-300 lines-functional-4+ MIPS
-350 lines-functional w/checks
-900 lines-functional-Lot of stats
-< 1000 lines-functional-Cache stats-Branch stats
-3900 lines-performance-OoO issue-Branch pred.-Mis-spec.-ALUs-Cache-TLB-200+ KIPSPerformance
Detail
23年 4月 20日 10
Sim-Fast
• Functional simulation• Optimized for speed• Assumes no cache• Assumes no instruction checking• Does not support Dlite!• Does not allow command line arguments• <300 lines of code
23年 4月 20日 11
Sim-Safe
• Functional simulation
• Checks for instruction errors
• Optimized for speed
• Assumes no cache
• Supports Dlite!
• Does not allow command line arguments
23年 4月 20日 12
Sim-Cache
• Cache simulation
• Ideal for fast simulation of caches (if the effect of cache performance on execution time is not necessary)
• Accepts command line arguments for:– level 1 & 2 instruction and data caches
– TLB configuration (data and instruction)
– Flush and compress
– and more
• Ideal for performing high-level cache studies that don’t take access time of the caches into account
Sim-Cache (cont'd)
• generates one- and two-level cache hierarchy statistics and profiles
• extra options (also supported on sim-outorder):-cache:dl1 <config> - level 1 data cache configuration
-cache:dl2 <config> - level 2 data cache configuration
-cache:il1 <config> - level 1 instruction cache configuration
-cache:il2 <config> - level 2 instruction cache configuration
-tlb:dtlb <config> - data TLB configuration
-tlb:itlb <config> - instruction TLB configuration
-flush <config> - flush caches on system calls
-icompress - remaps 64-bit inst addresses to 32-bit equiv.
-pcstat <stat> - record statistic <stat> by text address
23年 4月 20日 13
Specifying Cache Configurations• all caches and TLB configurations specified with same format:
<name>:<nsets>:<bsize>:<assoc>:<repl>
• where:<name> - cache name (make this unique)
<nsets> - number of sets
<assoc> - associativity (number of “ways”)
<repl> - set replacement policy
l - for LRU
f - for FIFO
r - for RANDOM
• examples:il1:1024:32:2:l 2-way set-assoc 64k-byte cache, LRU
dtlb:1:4096:64:r 64-entry fully assoc TLB w/ 4k pages,random replacement
23年 4月 20日 14
23年 4月 20日 15
Sim-Bpred
• Simulate different branch prediction mechanisms
• Generate prediction hit and miss rate reports
• Does not simulate the effect of branch prediction on total execution time
nottakentakenperfectbimod bimodal predictor2lev 2-level adaptive predictorcomb combined predictor (bimodal and 2-level)
23年 4月 20日 16
Sim-Profile● Program Profiler
● Generates detailed profiles, by symbol and by address
● Keeps track of and reports
● Dynamic instruction counts
● Instruction class counts
● Branch class counts
● Usage of address modes
● Profiles of the text & data segment
23年 4月 20日 17
Sim-Outorder
• Most complicated and detailed simulator
• Supports out-of-order issue and execution
• Provides reports– branch prediction
– cache
– external memory
– various configuration
Sim-Outorder: Detailed Performance Simulator
• generates timing statistics for a detailed out-of-order issue processor core with two-level cache memory hierarchy and main memory
• extra options:-fetch:ifqsize <size> - instruction fetch queue size (in insts)
-fetch:mplat <cycles> - extra branch mis-prediction latency (cycles)
-bpred <type> - specify the branch predictor
-decode:width <insts> - decoder bandwidth (insts/cycle)
-issue:width <insts> - RUU issue bandwidth (insts/cycle)
-issue:inorder - constrain instruction issue to program order
-issue:wrongpath - permit instruction issue after mis-speculation
-ruu:size <insts> - capacity of RUU (insts)
-lsq:size <insts> - capacity of load/store queue (insts)
-cache:dl1 <config> - level 1 data cache configuration
-cache:dl1lat <cycles> - level 1 data cache hit latency
23年 4月 20日 18
Sim-Outorder: Detailed Performance Simulator
-cache:dl2 <config> - level 2 data cache configuration
-cache:dl2lat <cycles> - level 2 data cache hit latency
-cache:il1 <config> - level 1 instruction cache configuration
-cache:il1lat <cycles> - level 1 instruction cache hit latency
-cache:il2 <config> - level 2 instruction cache configuration
-cache:il2lat <cycles> - level 2 instruction cache hit latency
-cache:flush - flush all caches on system calls
-cache:icompress - remap 64-bit inst addresses to 32-bit equiv.
-mem:lat <1st> <next> - specify memory access latency (first, rest)
-mem:width - specify width of memory bus (in bytes)
-tlb:itlb <config> - instruction TLB configuration
-tlb:dtlb <config> - data TLB configuration
-tlb:lat <cycles> - latency (in cycles) to service a TLB miss
23年 4月 20日 19
Sim-Outorder: Detailed Performance Simulator
-res:ialu - specify number of integer ALUs
-res:imult - specify number of integer multiplier/dividers
-res:memports - specify number of first-level cache ports
-res:fpalu - specify number of FP ALUs
-res:fpmult - specify number of FP multiplier/dividers
-pcstat <stat> - record statistic <stat> by text address
-ptrace <file> <range> - generate pipetrace
23年 4月 20日 20
Specifying the Branch Predictor
• specifying the branch predictor type:-bpred <type>
• the supported predictor types are:nottaken always predict not taken
taken always predict taken
perfect perfect predictor
bimod bimodal predictor (BTB w/ 2 bit counters)
2lev 2-level adaptive predictor
• configuring the bimodal predictor (only useful when “-bpred bimod” is specified):-bpred:bimod <size> size of direct-mapped BTB
23年 4月 20日 21
Specifying the Branch Predictor (cont'd)
• configuring the 2-level adaptive predictor (only useful when “-bpred 2lev” is specified):
-bpred:2lev <l1size> <l2size> <hist_size> <xor>
Configurations: N, M, W, X N:# entries in first level (# of shift register(s)) M:# entries in 2nd level (# of counters, or other FSM) W:width of shift register(s) (# of bits in each shift register) X:(yes-1/no-0) xor history (We use 0 for this homework.) and address for 2nd level index
Sample predictors: GAg: 1,M,W,0 where M = 2^W GAp: 1,M,W,0 where M = C*2^W, C is # of per-address prediction tables PAg: N,M,W,0 where M = 2^W PAp: N,M,W,0 where M = N * 2^W
23年 4月 20日 22
Performance Comparison of GAg,GAp, PAg and PAp
23年 4月 20日 23
Branch address
2-bits per branch predictor
Prediction
2-bit global branch history
4
(b) (2,2) predictor(a) GAp
• GAp: 1 global history register and 8 per-address prediction tables
Hack the state machine of Branch Predictor!
23年 4月 20日 24
T T
NT NT
Taken
Not taken
Not taken
Not takenNot taken
Taken
Taken
Taken
T T
NT NT
Taken
Not taken
Not taken
Not takenNot taken
Taken
Taken
Taken
(a) A3 (Same as shown in the textbook) (b) A2 (Original Simplescalar Implementation)
23年 4月 20日 25
Sim-Outorder HW Architecture
Fetch DispatchRegister
Scheduler Exe Writeback Commit
I-Cache
MemoryScheduler
Mem
Virtual Memory
D-Cache D-TLBI-TLB
23年 4月 20日 26
Sim-Outorder (Main Loop) • sim_main() in sim-outorder.c
ruu_init();for(;;){ ruu_commit(); ruu_writeback(); lsq_refresh(); ruu_issue(); ruu_dispatch(); ruu_fetch();}
• Executed once for each simulated machine cycle• Walks pipeline from Commit to Fetch
– Reverse traversal handles inter-stage latch synchronization by only one pass
23年 4月 20日 27
Sim-Outorder (RUU/LSQ)• RUU (Register Update Unit)
– Handles register synchronization/communication– Serves as reorder buffer and reservation stations– Performs out-of-order issue when register and memory
dependences are satisfied• LSQ (Load/Store Queue)
– Handles memory synchronization/communication– Contains all loads and stores in program order
• Relationship between RUU and LSQ– Memory dependencies are resolved by LSQ– Load/Store effective address calculated in RUU
23年 4月 20日 28
Sim-Outorder: Fetch
● ruu_fetch()● Models machine fetch bandwidth● Fetches instructions from one I-cache/memory
● block until I-cache misses are resolved● Instructions are put into the instruction fetch queue named
fetch_data in sim-outorder.c (it is also called dispatch queue in the paper)
● Probes branch predictor to obtain the cache line for next cycle
23年 4月 20日 29
Sim-Outorder: Dispatch
● ruu_dispatch()● Models instruction decoding and register renaming● Takes instructions from fetch_data● Decodes instructions● Enters and links instructions into RUU and LSQ● Splits memory operations into two separate instructions
23年 4月 20日 30
Sim-Outorder: Scheduler
● lsq_refresh()● Models instruction selection, wakeup and issue
● Separate schedulers track register and memory dependences. ● Locates instructions with all register inputs ready and all
memory inputs ready● Issue of ready loads is stalled if there is a store with unresolved
effective address in LSQ.● If earlier store address matches load address, target value is
forwarded to load.
23年 4月 20日 31
Sim-Outorder: Execute
● ruu_issue()● Models functional units, D-cache issue and executes
latencies● Gets instructions that are ready● Reserves free functional unit● Schedules writeback events using latency of the functional
unit● Latencies are hardcoded in fu_config[] in sim-outorder.c
23年 4月 20日 32
Sim-Outorder: Writeback
● ruu_writeback()● Models writeback bandwidth, detects mis-predictions,
initiated mis-prediction recovery sequence
● Gets execution finished instructions (specified in event queue)
● Wakes up instructions that are dependent on completed instruction on the dependence chains of instruction output
● Detects branch mis-prediction and roll state back to checkpoint
23年 4月 20日 33
Sim-Outorder: Commit
● ruu_commit()● Models in-order retirement of instructions, store commits
to the D-cache, and D-TLB miss handling
● While head of RUU/LSQ ready to commit● D-TLB miss handling● Retire store to D-cache● Update register file and rename table● Reclaim RUU/LSQ resources
23年 4月 20日 34
Sim-Outorder:Processor core and other specifications
• Instruction fetch, decode and issue bandwidth• Capacity of RUU and LSQ• Branch mis-prediction latency• Number of functional units
– integer ALU, integer multipliers/dividers– FP ALU, FP multipliers/dividers
• Latency of I-cache/D-cache, memory and TLB• Record statistic by text address
23年 4月 20日 35
Global Options
• These are supported on most simulators
-h print help message
-d enable debug message
-i start up in Dlite! Debugger
-q quit immediately (use with -dumpconfig)
-config read config parameters from <file>
-dumpconfig save config parameters into <file>