32
Superscalar Architectures Superscalar Architectures Jason Moore and Habib Ammari March 25 th , 2004 CSE 8383: Advanced Computer CSE 8383: Advanced Computer Architecture Architecture Instructor: Prof. Hesham El- Rewini

Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

Embed Size (px)

DESCRIPTION

A multiple-issue processor that issues varying numbers of instructions whether statically or dynamically scheduled Hazard detection is done in the hardware Execution - static scheduling: in order execution - dynamic scheduling: out of order execution Superscalar Architectures 3 Introduction and Review

Citation preview

Page 1: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

Superscalar ArchitecturesSuperscalar ArchitecturesJason Moore and Habib Ammari

March 25th, 2004

CSE 8383: Advanced Computer ArchitectureCSE 8383: Advanced Computer Architecture

Instructor: Prof. Hesham El-Rewini

Page 2: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

OutlineOutline• Introduction and Motivations

• Overview of Superscalar Architectures

• Asynchronous Superscalar Architecture Design

• Fault Tolerant Superscalar Design

• Comparison Study of Superscalar Architectures

• Summary and Conclusions Superscalar Architectures 2

Page 3: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

• A multiple-issue processor that issues varying numbers of instructions whether statically or dynamically scheduled

• Hazard detection is done in the hardware• Execution

- static scheduling: in order execution- dynamic scheduling: out of order execution

Superscalar Architectures 3

Introduction and ReviewIntroduction and Review

Page 4: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

Superscalar Architectures 4

Introduction and Review Introduction and Review (cont’d)(cont’d)

Page 5: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

• Why?- Currently we have to wait the time for the longest stage for a result- How often is the critical path taken?

• Issues with Asynchronous Design- Increase probability of timing faults- Loss predictability that clock provides

Superscalar Architectures 5

Asynchronous Superscalar DesignAsynchronous Superscalar Design

Page 6: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

Ad Hoc NetworkingAd Hoc Networking

• Instructions are sent to Execution Units• Depends on the compiler to group together

dependant instructions • Instruction compounding can be dynamic by

using a look-up table

Superscalar Architectures 6

Asynchronous Superscalar Design Asynchronous Superscalar Design (cont’d)(cont’d)

Page 7: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

• Data Forwarding request is sent to next instruction in the compound

• Request gains access to the write port• Once Request is received

- All other operands are available. Data Forwarding can occur- Instruction is waiting on other operands. Data Forwarding cannot occur

• Acknowledgement or cancellation signal is sent to the forwarding unit

Superscalar Architectures 7

Data ForwardingData Forwarding

Page 8: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

Superscalar Architectures 8

Performance ResultsPerformance Results

00.5

11.5

22.5

33.5

frep compress queens

Instructions Per CycleQueued

StaticallyCompoundDynamicallyCompoundSynchronousSuperscalar

Page 9: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

• What we have gained- slight speed up- no clock distribution problem

• The Cost- More circuitry- Longer design times- Increase probability of timing faults

Superscalar Architectures 9

Good Idea?Good Idea?

Page 10: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

•Why?- As chips reduce in size toward .1 micron, transient faults will increase.- Asynchronous designs such as the one I discussed earlier are prone to such faults.- Fault Tolerance can be added to the Superscalar Design at low cost

Superscalar Architectures 10

Fault Tolerant Superscalar DesignFault Tolerant Superscalar Design

Page 11: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

• Error Detection/Error Correction• Duplication of the system• Re-executing each program

Superscalar Architectures 11

Types of Fault Tolerant TechniquesTypes of Fault Tolerant Techniques

Page 12: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

• The ROB is modified so that all statements are executed twice

• This change will not require anymore entries in the ROB

• No more functional units will be required• What if the 2 results do not match?

Superscalar Architectures 12

Needed ChangesNeeded Changes

Page 13: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

• An extra bit can be added to the ROB table to represent if the statement is waiting to be run for the first time or second time.

• Existing functional units can be used without significant slow down since utilization is not at 100% due to hazards

• The 2 results are compared if they agree the statement can simply follow regular superscalar commitment algorithm

Superscalar Architectures 13

Basic Idea of How This WorksBasic Idea of How This Works

Page 14: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

• Two options- Run the statement a third time and take the result in which 2 of 3 agree upon

- Re-issue the statement

Superscalar Architectures 14

Basic Idea of How This works Basic Idea of How This works (cont’d)(cont’d)

F S D

F S D

Page 15: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

• If the number ALU’s is >= 2 then there is significant performance lost by the new system

Superscalar Architectures 15

PerformancePerformance

00.5

11.5

22.5

% Utilization

1 2 4 8Number of ALUs

Performance of O3RS

amptjpgcc

Page 16: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

Ad Hoc NetworkingAd Hoc Networking

Problem: Circuits with quadratic delay Ө(n2)where n: issue width

Study of a superscalar processor architectures: Ultrascalar

Performance study: VLSI complexities (gate delays, area)

Superscalar Architectures 16

Study of a Superscalar Architecture: UltrascalarStudy of a Superscalar Architecture: Ultrascalar

Page 17: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

Superscalar architecture implementation• Execution stations (ALUs and controllers)• Parallel-prefix tree circuits (Interconnection network)• An interleaved cache connected to the execution stations• Mechanism to communicate register values between

execution stationsMain goal• Design circuits to have at most linear delays

Superscalar Architectures 17

Study of a Superscalar Architecture: Study of a Superscalar Architecture: Ultrascalar Ultrascalar (cont’d)(cont’d)

Page 18: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

Performance metrics: three parameters• L (number of logical registers): registers seen

by the programmer (defined by the ISA) < realregisters employed by the processor

implementor• n (issue width of the processor): number of

instructions executed per clock cycle• M: bandwidth provided to memory

Memory bandwidth: M(n) = Ο(n) Superscalar Architectures 18

Study of a Superscalar Architecture: Study of a Superscalar Architecture: Ultrascalar Ultrascalar (cont’d)(cont’d)

Page 19: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

Ultrascalar Design• Passing of the entire logical register file with ready

bits to every outstanding instruction• The datapath of an Ultrascalar processor has 8

execution stations (responsible for decoding and executing instructions using the data in their register files)

• Execution station (ES) “classification” (oldest ES and all younger ones to its right), t: time

Superscalar Architectures 19

Study of a Superscalar Architecture: Study of a Superscalar Architecture: Ultrascalar Ultrascalar (cont’d)(cont’d)

Page 20: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

• Instruction Sequence Execution Station R3 = R1 / R2 ES6 R0 = R0 + R3 ES7 R1 = R5 + R6 ES0 R1 = R0 + R1 ES1 R2 = R5 * R6

ES2 R2 = R2 + R4 ES3 R0 = R5 – R6 ES4 R4 = R0 + R7 ES5

Superscalar Architectures 20

• Communication between ESs through rings of MUXes• L rings of MUXes (one for each logical register defined by the IS)

Study of a Superscalar Architecture: Study of a Superscalar Architecture: Ultrascalar Ultrascalar (cont’d)(cont’d)

Page 21: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

Superscalar Architectures 21

Study of a Superscalar Architecture: Study of a Superscalar Architecture: Ultrascalar Ultrascalar (cont’d)(cont’d)

Page 22: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

• (logical register’s value, ready bit) is carried by a MUX to successive ESs, and a new pair is inserted at every update of the ring’s register

• Ready bit: indicator of whether an instruction computed the register’s value

• Initialization of the registers of each ring done by the oldest ES.

• An ES becomes the oldest one on the next clock cycle if it is holding the oldest unfinished instruction

• ES’s internal structure: ALU, register file, instruction decode logic, and control logic

Superscalar Architectures 22

Study of a Superscalar Architecture: Study of a Superscalar Architecture: Ultrascalar Ultrascalar (cont’d)(cont’d)

Page 23: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

Superscalar Architectures 23

Study of a Superscalar Architecture: Study of a Superscalar Architecture: Ultrascalar Ultrascalar (cont’d)(cont’d)

Page 24: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

Superscalar Architectures 24

Study of a Superscalar Architecture: Study of a Superscalar Architecture: Ultrascalar Ultrascalar (cont’d)(cont’d)

Page 25: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

• Scalability issue: a newly computed value ispropagated through the entire ring of multiplexers inone clock cycle

• Number of multiplexers the ring = number of outstanding instructions (n) linear gate delay: O(n)

Goal: reduce the clock cycle• Replace each ring of processors with a cyclic,

segmented, parallel prefix (CSPP) circuit• Circuit’s gate delay is logarithmic (tree structure)

O(log2 n) Superscalar Architectures 25

Study of a Superscalar Architecture: Study of a Superscalar Architecture: Ultrascalar Ultrascalar (cont’d)(cont’d)

Page 26: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

Superscalar Architectures 26

Study of a Superscalar Architecture: Study of a Superscalar Architecture: Ultrascalar Ultrascalar (cont’d)(cont’d)

Page 27: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

Analysis of Ultrascalar• Objective: determine the Ultrascalar datapath’s

wire delay and area• Wire delay and area = f (VLSI layout)• 2-dimentional VLSI layout: 16 ESs connected

together and to memory• 2 types of nodes: P (propagates the value of one

logical register) and M (routes a number of memory accesses)

• X(n): the side length of an n-station Ultrascalar layout

Superscalar Architectures 27

Study of a Superscalar Architecture: Study of a Superscalar Architecture: Ultrascalar Ultrascalar (cont’d)(cont’d)

Page 28: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

Superscalar Architectures 28

Study of a Superscalar Architecture: Study of a Superscalar Architecture: Ultrascalar Ultrascalar (cont’d)(cont’d)

Page 29: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

Superscalar Architectures 29

X(n) = 2X(n/4) + width of the wires to connect the four n/4-wide Ultrascalars

• (L) wires to connect the registers• (M(n)) wires to provide memory bandwidth out of

a subtree of n ESs• A 1-station-wide Ultrascalar has width (L)X(n) = (L) + (M(n)) + 2X(n/4) if n > 1

= (L) + (M(1)) = (L) otherwise

Study of a Superscalar Architecture: Study of a Superscalar Architecture: Ultrascalar Ultrascalar (cont’d)(cont’d)

Page 30: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

X(n) = (n1/2L) if M(n) = (n1/2-) for >0 (1) = (n1/2(L + log n)) if M(n) = (n1/2) (2) = (n1/2L + M(n)) if M(n) = (n1/2+) for >0 (3)

• L = constant cases 1 and 3 optimal; case 2 nearoptimal (optimal to within a factor of log n)

• Case 1: n ESs require a chip that is (n1/2) on a side• Case 2: similar to case 1• Case 3: external memory bandwidth of M(n) requires a

side length of (M(n))

Superscalar Architectures 30

Study of a Superscalar Architecture: Study of a Superscalar Architecture: Ultrascalar Ultrascalar (cont’d)(cont’d)

Page 31: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

Ad Hoc NetworkingAd Hoc Networking

• Asynchronous superscalar design makes slight speed gains at added costs of circuitry, design time, and more faults

• The Superscalar architecture can inexpensively be modified to be fault tolerant

• Study of superscalar architecture: Ultrascalar• Notion of complexity• Performance analysis w.r.t. quantitative metrics

Superscalar Architectures 31

Summary and ConclusionsSummary and Conclusions

Page 32: Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini

Ad Hoc NetworkingAd Hoc Networking

• D.K. Arvind and Robert D. Mullins, A Fully Asynchronous Superscalar Architecture, In M. Moonen and F. Catthoor, editors, Proc. of the Third Int. Workshop on Algorithms and Parallel VLSI Architectures: pages 203-215, Elsevier Science Publishers, Aug. 1994.

• D. Henry, B. Kuszmaul, and V. Viswanath, “The Ultrascalar Processor – An Asymptotically Scalable superscalar Microarchitecture”, The Twentieth Anniversary Conference on Advanced Research in VLSI (ARVLSI’99), Atlanta, Georgia, March 21-24, 1999.

• B. Kuszmaul, D. Henry, and G. Loh, “A comparison of Scalable Superscalar Processors”, Proceedings of the Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures, Saint Malo, France, June, 1999

• Avi Mendelson and Neeraj Suri, Designing High-Performance & Reliable Superscalar Architectures The Out of Order Reliable Superscalar (O3RS) Approach, DSN 2000 : pages 473-481, June 2000.

• S. Wallace and N. Bagherzadeh, Performance Issues of a Superscalar Microprocessor, 23rd International Conference of Parallel Processing , August 1994.

Superscalar Architectures 32

Useful PointersUseful Pointers