Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

Superscalar ArchitecturesSuperscalar ArchitecturesJason Moore and Habib Ammari

March 25th, 2004

CSE 8383: Advanced Computer ArchitectureCSE 8383: Advanced Computer Architecture

Instructor: Prof. Hesham El-Rewini

http://www.engr.smu.edu/

http://www.engr.smu.edu/cse/

OutlineOutline• Introduction and Motivations

• Overview of Superscalar Architectures

• Asynchronous Superscalar Architecture Design

• Fault Tolerant Superscalar Design

• Comparison Study of Superscalar Architectures

• Summary and Conclusions Superscalar Architectures 2

• A multiple-issue processor that issues varying numbers of instructions whether statically or dynamically scheduled

• Hazard detection is done in the hardware• Execution

- static scheduling: in order execution- dynamic scheduling: out of order execution

Superscalar Architectures 3

Introduction and ReviewIntroduction and Review


Introduction and Review Introduction and Review (cont’d)(cont’d)

• Why?- Currently we have to wait the time for the longest stage for a result- How often is the critical path taken?

• Issues with Asynchronous Design- Increase probability of timing faults- Loss predictability that clock provides


Asynchronous Superscalar DesignAsynchronous Superscalar Design

Ad Hoc NetworkingAd Hoc Networking

• Instructions are sent to Execution Units• Depends on the compiler to group together

dependant instructions • Instruction compounding can be dynamic by

using a look-up table


Asynchronous Superscalar Design Asynchronous Superscalar Design (cont’d)(cont’d)

• Data Forwarding request is sent to next instruction in the compound

• Request gains access to the write port• Once Request is received

- All other operands are available. Data Forwarding can occur- Instruction is waiting on other operands. Data Forwarding cannot occur

• Acknowledgement or cancellation signal is sent to the forwarding unit


Data ForwardingData Forwarding


Performance ResultsPerformance Results

00.5

11.5

22.5

33.5

frep compress queens

Instructions Per CycleQueued

StaticallyCompoundDynamicallyCompoundSynchronousSuperscalar

• What we have gained- slight speed up- no clock distribution problem

• The Cost- More circuitry- Longer design times- Increase probability of timing faults


Good Idea?Good Idea?

•Why?- As chips reduce in size toward .1 micron, transient faults will increase.- Asynchronous designs such as the one I discussed earlier are prone to such faults.- Fault Tolerance can be added to the Superscalar Design at low cost


Fault Tolerant Superscalar DesignFault Tolerant Superscalar Design

• Error Detection/Error Correction• Duplication of the system• Re-executing each program


Types of Fault Tolerant TechniquesTypes of Fault Tolerant Techniques

• The ROB is modified so that all statements are executed twice

• This change will not require anymore entries in the ROB

• No more functional units will be required• What if the 2 results do not match?


Needed ChangesNeeded Changes

• An extra bit can be added to the ROB table to represent if the statement is waiting to be run for the first time or second time.

• Existing functional units can be used without significant slow down since utilization is not at 100% due to hazards

• The 2 results are compared if they agree the statement can simply follow regular superscalar commitment algorithm


Basic Idea of How This WorksBasic Idea of How This Works

• Two options- Run the statement a third time and take the result in which 2 of 3 agree upon

- Re-issue the statement


Basic Idea of How This works Basic Idea of How This works (cont’d)(cont’d)

F S D

F S D

• If the number ALU’s is >= 2 then there is significant performance lost by the new system


PerformancePerformance

00.5

11.5

22.5

% Utilization

1 2 4 8Number of ALUs

Performance of O3RS

amptjpgcc


Problem: Circuits with quadratic delay Ө(n2)where n: issue width

Study of a superscalar processor architectures: Ultrascalar

Performance study: VLSI complexities (gate delays, area)


Study of a Superscalar Architecture: UltrascalarStudy of a Superscalar Architecture: Ultrascalar

Superscalar architecture implementation• Execution stations (ALUs and controllers)• Parallel-prefix tree circuits (Interconnection network)• An interleaved cache connected to the execution stations• Mechanism to communicate register values between

execution stationsMain goal• Design circuits to have at most linear delays


Study of a Superscalar Architecture: Study of a Superscalar Architecture: Ultrascalar Ultrascalar (cont’d)(cont’d)

Performance metrics: three parameters• L (number of logical registers): registers seen

by the programmer (defined by the ISA) < realregisters employed by the processor

implementor• n (issue width of the processor): number of

instructions executed per clock cycle• M: bandwidth provided to memory

Memory bandwidth: M(n) = Ο(n) Superscalar Architectures 18


Ultrascalar Design• Passing of the entire logical register file with ready

bits to every outstanding instruction• The datapath of an Ultrascalar processor has 8

execution stations (responsible for decoding and executing instructions using the data in their register files)

• Execution station (ES) “classification” (oldest ES and all younger ones to its right), t: time



• Instruction Sequence Execution Station R3 = R1 / R2 ES6 R0 = R0 + R3 ES7 R1 = R5 + R6 ES0 R1 = R0 + R1 ES1 R2 = R5 * R6

ES2 R2 = R2 + R4 ES3 R0 = R5 – R6 ES4 R4 = R0 + R7 ES5


• Communication between ESs through rings of MUXes• L rings of MUXes (one for each logical register defined by the IS)




• (logical register’s value, ready bit) is carried by a MUX to successive ESs, and a new pair is inserted at every update of the ring’s register

• Ready bit: indicator of whether an instruction computed the register’s value

• Initialization of the registers of each ring done by the oldest ES.

• An ES becomes the oldest one on the next clock cycle if it is holding the oldest unfinished instruction

• ES’s internal structure: ALU, register file, instruction decode logic, and control logic







• Scalability issue: a newly computed value ispropagated through the entire ring of multiplexers inone clock cycle

• Number of multiplexers the ring = number of outstanding instructions (n) linear gate delay: O(n)

Goal: reduce the clock cycle• Replace each ring of processors with a cyclic,

segmented, parallel prefix (CSPP) circuit• Circuit’s gate delay is logarithmic (tree structure)

O(log2 n) Superscalar Architectures 25




Analysis of Ultrascalar• Objective: determine the Ultrascalar datapath’s

wire delay and area• Wire delay and area = f (VLSI layout)• 2-dimentional VLSI layout: 16 ESs connected

together and to memory• 2 types of nodes: P (propagates the value of one

logical register) and M (routes a number of memory accesses)

• X(n): the side length of an n-station Ultrascalar layout






X(n) = 2X(n/4) + width of the wires to connect the four n/4-wide Ultrascalars

• (L) wires to connect the registers• (M(n)) wires to provide memory bandwidth out of

a subtree of n ESs• A 1-station-wide Ultrascalar has width (L)X(n) = (L) + (M(n)) + 2X(n/4) if n > 1

= (L) + (M(1)) = (L) otherwise


X(n) = (n1/2L) if M(n) = (n1/2-) for >0 (1) = (n1/2(L + log n)) if M(n) = (n1/2) (2) = (n1/2L + M(n)) if M(n) = (n1/2+) for >0 (3)

• L = constant cases 1 and 3 optimal; case 2 nearoptimal (optimal to within a factor of log n)

• Case 1: n ESs require a chip that is (n1/2) on a side• Case 2: similar to case 1• Case 3: external memory bandwidth of M(n) requires a

side length of (M(n))




• Asynchronous superscalar design makes slight speed gains at added costs of circuitry, design time, and more faults

• The Superscalar architecture can inexpensively be modified to be fault tolerant

• Study of superscalar architecture: Ultrascalar• Notion of complexity• Performance analysis w.r.t. quantitative metrics


Summary and ConclusionsSummary and Conclusions


• D.K. Arvind and Robert D. Mullins, A Fully Asynchronous Superscalar Architecture, In M. Moonen and F. Catthoor, editors, Proc. of the Third Int. Workshop on Algorithms and Parallel VLSI Architectures: pages 203-215, Elsevier Science Publishers, Aug. 1994.

• D. Henry, B. Kuszmaul, and V. Viswanath, “The Ultrascalar Processor – An Asymptotically Scalable superscalar Microarchitecture”, The Twentieth Anniversary Conference on Advanced Research in VLSI (ARVLSI’99), Atlanta, Georgia, March 21-24, 1999.

• B. Kuszmaul, D. Henry, and G. Loh, “A comparison of Scalable Superscalar Processors”, Proceedings of the Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures, Saint Malo, France, June, 1999

• Avi Mendelson and Neeraj Suri, Designing High-Performance & Reliable Superscalar Architectures The Out of Order Reliable Superscalar (O3RS) Approach, DSN 2000 : pages 473-481, June 2000.

• S. Wallace and N. Bagherzadeh, Performance Issues of a Superscalar Microprocessor, 23rd International Conference of Parallel Processing , August 1994.


Useful PointersUseful Pointers

Documents

Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini