View
214
Download
0
Category
Tags:
Preview:
Citation preview
Reducing Issue Logic Complexity in Superscalar
Microprocessors
Survey ProjectCprE 585 – Advanced Computer Architecture
David LastineGanesh Subramanian
Introduction The ultimate goal of any computer architect –
designing a fast machine Approaches
Increasing clocking rate (Help from VLSI) Increasing bus width Increasing pipeline depth Superscalar architectures
Tradeoffs between hardware complexity and clock speed
Given a particular technology, the more complex the hardware, the lesser is the clocking rate
A New Paradigm Retaining the effective functionality of
complex superscalar processors Target the bottleneck in present day
microprocessors Instruction scheduling is the throughput limiter Need to effectively handle register renaming, issue
window and wakeup selector Increase the clocking rate
Rethinking circuit design methodologies Modifying architectural design strategies
Wanting to have the cake and eat it too? Aim at reducing power consumption too
Approaches to Handle Issue Logic Complexity
Performance = IPC * Clock Frequency Pipelining scheduling logic reduces the IPC Non-pipelined scheduling logic reduces
clocking rate Architectural solutions
Non-pipelined scheduling with dependence queue based issue logic – Complexity Effective [1]
Pipelined scheduling with speculative wakeup [2]
Generic speed up and power conservation using tag elimination [3]
Baseline Superscalar Model
The rename and the wake-up select stages of the generic superscalar pipeline model need to be targeted
Consider VLSI effects and decide to redesign a particular design component
Analyzing Baseline Implementations
Physical layout implementation of microprocessor circuits optimized for speed Usage of dynamic logic for bottleneck circuits Manual sizing of transistors in critical path Logic optimizations like two level decomposition
Components analyzed Register rename logic Wakeup Logic / Issue window Selection logic Bypass logic
Register Rename Logic
RAM vs. CAM Focus on RAM due to scalability Decreasing feature sizes do not correspondingly scale down
wire delays, but only logic delays Delay relation with issue width is quadratic, but effectively
linear Need to handle wordline and bitline delays in future
Wakeup Logic
CAM is preferred Tag drive times are quadratic functions of window size as
well as issue width Matching times are quadratic functions of issue width only All delays are effectively linear for considered design space Need to handle broadcast operation delays in future
Selection Logic
Tree of arbiters Requests flow down while functional unit grants flow up to
the issue window Necessity of a selection policy (Oldest First / Leftmost First) Delays proportional to the logarithm of the window size All delays considered are logic delays
Bypass Logic Number of bypass paths
dependent upon pipeline depth (linear) and issue width (quadratic)
Composed of operand muxes and buffer drivers
Delays are quadratically proportional to length of result wires and hence issue width
Insignificant compared to other delays as feature size reduces
Complexity Effective Microarchitecture Design Premises
Retain benefits of complex issue schemes but enable faster clocking
Design assumption: Should not pipeline wakeup + select, or data bypassing, as these are atomic operations (if dependent instruction should be executable in consecutive cycles)
Dependence Based Microarchitecture
Replace Issue Window by FIFOs with each queue composed of dependent instructions
Steer instructions to the appropriate FIFO in rename stage using heuristics
‘SRC_FIFO’ and ‘Reservations Tables’ to handle dependencies and wakeup
IPC reduces but clocking rate increases to give a faster implementation
Clustering Dependence Based Microarchitectures
Reducing bypass delays by reducing length of bypass paths
Minimization of inter-cluster communication, extra cycle penalty otherwise
Clustered Microarchitecture Types
Single Window, Execution Driven Steering
Two Windows, Dispatch Driven Steering - Best
Two Windows, Random Steering
Pipelining Dynamic Instruction Scheduling Logic
Wakeup+Select was held atomic in previous implementation
Increase performance by pipelining it, but retain execution of dependent instruction in consecutive cycles
Speculate on the wakeup by predicting based on both parent and grandparent instructions
Integrated into the Tomasulo approach
Wakeup Logic Details Tag broadcast as soon as instruction begins execution Broadcast – Execution Completion latency specified as shown Match bit acts as the sticky bit to enable delay countdown Need not always be correct due to unexpected stalls Select logic remains as in previous work
Pipelining Rename Logic
Assumption by child instruction that parent would broadcast its tag in the next cycle, IF grandparent instructions broadcasts tag
Speculative wakeup on grandparent tag receiving for selection in the next cycle
Speculative since parent selection for execution is not guaranteed
Modifications in rename map and dependency analysis logic
Wakeup and Select Logic Wakeup request sent after looking into
ready bits from the parents’ and grandparents’ tags
A multi-cycle parent’s field can be ignored In addition to speculative readiness
signified by request line, a confirm line is activated when all parents are ready
False selection involve non-confirmed requests
Problematic only when really ready instructions are not selected
Implementation & Experimentation Details
Usage of a cycle accurate execution driven simulator for the Alpha ISA
Baseline conventional scheduled (2) pipeline Budget / Deluxe – speculatively woken up scheduling Ideal – 1 cycle scheduling pipeline
Factors like issue width and reservation station depth considered
Significant reduction in critical path with minor IPC impacts
Enables higher clock frequencies, deeper pipelines and larger instruction windows for better performance
Paradigm shift
So far we’ve added hardware to improve performance
However issue window could also be improved by removing hardware
Current Situation of Issue Windows
Content Addressable Memory (CAM) latency dominates instruction window latency.
Load Capacitance of CAM is a major limiting factor for speed.
Parasitic Capacitance also waste power. Issue logic uses a lot of the power
budget 16% for the Pentium Pro 18% for Alpha 21264
Unnecessary Circuity
Observation: Register stations compare broadcast tags to both operands. Often, this is unnecessary.
Only 25% to 35% of architectural instructions have two operands.
Simulation of speck2k programs shows only 10% to 20% of instructions need two comparators during runtime.
Simulation Used SimpleScalar Varied instruction window size 16, 64,
256. Load/Store queue of half window size.
Removing extra comparators Specialize the reservation stations.
Number of comparators varies by station from 2 to 0.
Stall if no station with minimum comparator available
Remove some operands by speculating on last operand to complete. Needs predictor Miss-predict penalty
Predictor Paper discuses GSHARE predictor Its based off branch predictor not seen in
class. Idea behind it starts by noting good indexes
for selecting binary predictors are Branch address Global history
Thus if both are good, XORing them together should produce an index embodying more information than ether alone.
Mis-pridiction
Alpha has scoreboard of valid registers called RDY.
Check if all operands available in register read stage, if not flush pipeline in the same fashion as latency miss-prediction.
RDY must be expanded to have the number of read ports match the issue width.
IPC losses
Reservation stations with two ports can be exhausted. Causes stalls for speck2k benchmarks like SWIM
Adding last tag prediction improves SWIM performance but causes 1-3% losses for benchmarks such as Crafly and Gcc due to misprediction
Simulation Format show is for number of two tag/one tag/
zero tag Last tag predictor used only on entries with no
two tag reservation stations.
Benefits of comparator removal In most cases clock rate can be 25-45%
faster since Tag bus no longer must reach all
reservation stations Removing comparators removes load
capacitance Energy saved from capacitance removal
is 30-60% Power savings don’t track energy saves this
clock rate can now increase.
References
1. Complexity-effective superscalar processors1. Subbarao Palacharla and Norman P. Jouppi and J.
E. Smith
2. On pipelining dynamic instruction scheduling logic
1. J. Stark, M. D. Brown, and Yale N. Patt
3. Efficient Dynamic Scheduling Through Tag Elimination
1. Dan Ernst and Todd Austin
4. Combining Branch Predictors1. Scott McFarling
Recommended