Upload
lee-potter
View
229
Download
4
Embed Size (px)
DESCRIPTION
A multiple-issue processor that issues varying numbers of instructions whether statically or dynamically scheduled Hazard detection is done in the hardware Execution - static scheduling: in order execution - dynamic scheduling: out of order execution Superscalar Architectures 3 Introduction and Review
Citation preview
Superscalar ArchitecturesSuperscalar ArchitecturesJason Moore and Habib Ammari
March 25th, 2004
CSE 8383: Advanced Computer ArchitectureCSE 8383: Advanced Computer Architecture
Instructor: Prof. Hesham El-Rewini
OutlineOutline• Introduction and Motivations
• Overview of Superscalar Architectures
• Asynchronous Superscalar Architecture Design
• Fault Tolerant Superscalar Design
• Comparison Study of Superscalar Architectures
• Summary and Conclusions Superscalar Architectures 2
• A multiple-issue processor that issues varying numbers of instructions whether statically or dynamically scheduled
• Hazard detection is done in the hardware• Execution
- static scheduling: in order execution- dynamic scheduling: out of order execution
Superscalar Architectures 3
Introduction and ReviewIntroduction and Review
Superscalar Architectures 4
Introduction and Review Introduction and Review (cont’d)(cont’d)
• Why?- Currently we have to wait the time for the longest stage for a result- How often is the critical path taken?
• Issues with Asynchronous Design- Increase probability of timing faults- Loss predictability that clock provides
Superscalar Architectures 5
Asynchronous Superscalar DesignAsynchronous Superscalar Design
Ad Hoc NetworkingAd Hoc Networking
• Instructions are sent to Execution Units• Depends on the compiler to group together
dependant instructions • Instruction compounding can be dynamic by
using a look-up table
Superscalar Architectures 6
Asynchronous Superscalar Design Asynchronous Superscalar Design (cont’d)(cont’d)
• Data Forwarding request is sent to next instruction in the compound
• Request gains access to the write port• Once Request is received
- All other operands are available. Data Forwarding can occur- Instruction is waiting on other operands. Data Forwarding cannot occur
• Acknowledgement or cancellation signal is sent to the forwarding unit
Superscalar Architectures 7
Data ForwardingData Forwarding
Superscalar Architectures 8
Performance ResultsPerformance Results
00.5
11.5
22.5
33.5
frep compress queens
Instructions Per CycleQueued
StaticallyCompoundDynamicallyCompoundSynchronousSuperscalar
• What we have gained- slight speed up- no clock distribution problem
• The Cost- More circuitry- Longer design times- Increase probability of timing faults
Superscalar Architectures 9
Good Idea?Good Idea?
•Why?- As chips reduce in size toward .1 micron, transient faults will increase.- Asynchronous designs such as the one I discussed earlier are prone to such faults.- Fault Tolerance can be added to the Superscalar Design at low cost
Superscalar Architectures 10
Fault Tolerant Superscalar DesignFault Tolerant Superscalar Design
• Error Detection/Error Correction• Duplication of the system• Re-executing each program
Superscalar Architectures 11
Types of Fault Tolerant TechniquesTypes of Fault Tolerant Techniques
• The ROB is modified so that all statements are executed twice
• This change will not require anymore entries in the ROB
• No more functional units will be required• What if the 2 results do not match?
Superscalar Architectures 12
Needed ChangesNeeded Changes
• An extra bit can be added to the ROB table to represent if the statement is waiting to be run for the first time or second time.
• Existing functional units can be used without significant slow down since utilization is not at 100% due to hazards
• The 2 results are compared if they agree the statement can simply follow regular superscalar commitment algorithm
Superscalar Architectures 13
Basic Idea of How This WorksBasic Idea of How This Works
• Two options- Run the statement a third time and take the result in which 2 of 3 agree upon
- Re-issue the statement
Superscalar Architectures 14
Basic Idea of How This works Basic Idea of How This works (cont’d)(cont’d)
F S D
F S D
• If the number ALU’s is >= 2 then there is significant performance lost by the new system
Superscalar Architectures 15
PerformancePerformance
00.5
11.5
22.5
% Utilization
1 2 4 8Number of ALUs
Performance of O3RS
amptjpgcc
Ad Hoc NetworkingAd Hoc Networking
Problem: Circuits with quadratic delay Ө(n2)where n: issue width
Study of a superscalar processor architectures: Ultrascalar
Performance study: VLSI complexities (gate delays, area)
Superscalar Architectures 16
Study of a Superscalar Architecture: UltrascalarStudy of a Superscalar Architecture: Ultrascalar
Superscalar architecture implementation• Execution stations (ALUs and controllers)• Parallel-prefix tree circuits (Interconnection network)• An interleaved cache connected to the execution stations• Mechanism to communicate register values between
execution stationsMain goal• Design circuits to have at most linear delays
Superscalar Architectures 17
Study of a Superscalar Architecture: Study of a Superscalar Architecture: Ultrascalar Ultrascalar (cont’d)(cont’d)
Performance metrics: three parameters• L (number of logical registers): registers seen
by the programmer (defined by the ISA) < realregisters employed by the processor
implementor• n (issue width of the processor): number of
instructions executed per clock cycle• M: bandwidth provided to memory
Memory bandwidth: M(n) = Ο(n) Superscalar Architectures 18
Study of a Superscalar Architecture: Study of a Superscalar Architecture: Ultrascalar Ultrascalar (cont’d)(cont’d)
Ultrascalar Design• Passing of the entire logical register file with ready
bits to every outstanding instruction• The datapath of an Ultrascalar processor has 8
execution stations (responsible for decoding and executing instructions using the data in their register files)
• Execution station (ES) “classification” (oldest ES and all younger ones to its right), t: time
Superscalar Architectures 19
Study of a Superscalar Architecture: Study of a Superscalar Architecture: Ultrascalar Ultrascalar (cont’d)(cont’d)
• Instruction Sequence Execution Station R3 = R1 / R2 ES6 R0 = R0 + R3 ES7 R1 = R5 + R6 ES0 R1 = R0 + R1 ES1 R2 = R5 * R6
ES2 R2 = R2 + R4 ES3 R0 = R5 – R6 ES4 R4 = R0 + R7 ES5
Superscalar Architectures 20
• Communication between ESs through rings of MUXes• L rings of MUXes (one for each logical register defined by the IS)
Study of a Superscalar Architecture: Study of a Superscalar Architecture: Ultrascalar Ultrascalar (cont’d)(cont’d)
Superscalar Architectures 21
Study of a Superscalar Architecture: Study of a Superscalar Architecture: Ultrascalar Ultrascalar (cont’d)(cont’d)
• (logical register’s value, ready bit) is carried by a MUX to successive ESs, and a new pair is inserted at every update of the ring’s register
• Ready bit: indicator of whether an instruction computed the register’s value
• Initialization of the registers of each ring done by the oldest ES.
• An ES becomes the oldest one on the next clock cycle if it is holding the oldest unfinished instruction
• ES’s internal structure: ALU, register file, instruction decode logic, and control logic
Superscalar Architectures 22
Study of a Superscalar Architecture: Study of a Superscalar Architecture: Ultrascalar Ultrascalar (cont’d)(cont’d)
Superscalar Architectures 23
Study of a Superscalar Architecture: Study of a Superscalar Architecture: Ultrascalar Ultrascalar (cont’d)(cont’d)
Superscalar Architectures 24
Study of a Superscalar Architecture: Study of a Superscalar Architecture: Ultrascalar Ultrascalar (cont’d)(cont’d)
• Scalability issue: a newly computed value ispropagated through the entire ring of multiplexers inone clock cycle
• Number of multiplexers the ring = number of outstanding instructions (n) linear gate delay: O(n)
Goal: reduce the clock cycle• Replace each ring of processors with a cyclic,
segmented, parallel prefix (CSPP) circuit• Circuit’s gate delay is logarithmic (tree structure)
O(log2 n) Superscalar Architectures 25
Study of a Superscalar Architecture: Study of a Superscalar Architecture: Ultrascalar Ultrascalar (cont’d)(cont’d)
Superscalar Architectures 26
Study of a Superscalar Architecture: Study of a Superscalar Architecture: Ultrascalar Ultrascalar (cont’d)(cont’d)
Analysis of Ultrascalar• Objective: determine the Ultrascalar datapath’s
wire delay and area• Wire delay and area = f (VLSI layout)• 2-dimentional VLSI layout: 16 ESs connected
together and to memory• 2 types of nodes: P (propagates the value of one
logical register) and M (routes a number of memory accesses)
• X(n): the side length of an n-station Ultrascalar layout
Superscalar Architectures 27
Study of a Superscalar Architecture: Study of a Superscalar Architecture: Ultrascalar Ultrascalar (cont’d)(cont’d)
Superscalar Architectures 28
Study of a Superscalar Architecture: Study of a Superscalar Architecture: Ultrascalar Ultrascalar (cont’d)(cont’d)
Superscalar Architectures 29
X(n) = 2X(n/4) + width of the wires to connect the four n/4-wide Ultrascalars
• (L) wires to connect the registers• (M(n)) wires to provide memory bandwidth out of
a subtree of n ESs• A 1-station-wide Ultrascalar has width (L)X(n) = (L) + (M(n)) + 2X(n/4) if n > 1
= (L) + (M(1)) = (L) otherwise
Study of a Superscalar Architecture: Study of a Superscalar Architecture: Ultrascalar Ultrascalar (cont’d)(cont’d)
X(n) = (n1/2L) if M(n) = (n1/2-) for >0 (1) = (n1/2(L + log n)) if M(n) = (n1/2) (2) = (n1/2L + M(n)) if M(n) = (n1/2+) for >0 (3)
• L = constant cases 1 and 3 optimal; case 2 nearoptimal (optimal to within a factor of log n)
• Case 1: n ESs require a chip that is (n1/2) on a side• Case 2: similar to case 1• Case 3: external memory bandwidth of M(n) requires a
side length of (M(n))
Superscalar Architectures 30
Study of a Superscalar Architecture: Study of a Superscalar Architecture: Ultrascalar Ultrascalar (cont’d)(cont’d)
Ad Hoc NetworkingAd Hoc Networking
• Asynchronous superscalar design makes slight speed gains at added costs of circuitry, design time, and more faults
• The Superscalar architecture can inexpensively be modified to be fault tolerant
• Study of superscalar architecture: Ultrascalar• Notion of complexity• Performance analysis w.r.t. quantitative metrics
Superscalar Architectures 31
Summary and ConclusionsSummary and Conclusions
Ad Hoc NetworkingAd Hoc Networking
• D.K. Arvind and Robert D. Mullins, A Fully Asynchronous Superscalar Architecture, In M. Moonen and F. Catthoor, editors, Proc. of the Third Int. Workshop on Algorithms and Parallel VLSI Architectures: pages 203-215, Elsevier Science Publishers, Aug. 1994.
• D. Henry, B. Kuszmaul, and V. Viswanath, “The Ultrascalar Processor – An Asymptotically Scalable superscalar Microarchitecture”, The Twentieth Anniversary Conference on Advanced Research in VLSI (ARVLSI’99), Atlanta, Georgia, March 21-24, 1999.
• B. Kuszmaul, D. Henry, and G. Loh, “A comparison of Scalable Superscalar Processors”, Proceedings of the Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures, Saint Malo, France, June, 1999
• Avi Mendelson and Neeraj Suri, Designing High-Performance & Reliable Superscalar Architectures The Out of Order Reliable Superscalar (O3RS) Approach, DSN 2000 : pages 473-481, June 2000.
• S. Wallace and N. Bagherzadeh, Performance Issues of a Superscalar Microprocessor, 23rd International Conference of Parallel Processing , August 1994.
Superscalar Architectures 32
Useful PointersUseful Pointers