CS 152 Computer Architecture and Engineeringinst.eecs.berkeley.edu/~cs152/sp14/lecnotes/lec8-1.pdf · DEC Alpha 21164 Lockup-free cache integration. Uses techniques we cover in Part

UC Regents Spring 2014 © UCBCS 152 L15: Superscalars and Scoreboards

2014-3-11John Lazzaro

(not a prof - “John” is always OK)

CS 152Computer Architecture and Engineering

www-inst.eecs.berkeley.edu/~cs152/

TA: Eric Love

Lecture 15 -- Advanced CPUs

Play:1Tuesday, March 11, 14

http://www.eecs.berkeley.edu/~johnw/

http://www.eecs.berkeley.edu/~johnw/

DEC Alpha 21164

Top performing microprocessor in its day (1995).

300 MFLOPS in 0.5µ CMOS,@ 300 MHz.

2Tuesday, March 11, 14

DEC Alpha 21164

Lockup-free cache integration.

Uses techniques we cover in Part I of lecture.

Use of many functional units.

Many instructions issued per cycle (superscalar)


DEC Alpha 21164

Most of chip is cache (in blue).

This 4-issue chip was the high watermark for in-order designs.

In 2014,in-order superscalar lives in the cost-sensitive sector ...


Marvell Embedded CPU: In-order dual-core superscalar

Chromecast:Web browser in a flash-drive form factor. Plugs into the HDMI port on a TV. Includes a Wi-Fi chip so you can control the browser from your cell phone.

Wi-Fi ARM CPU(Marvell)

512 MB DRAM

2 GB Flash

$35 retail implies Bill of

Materials (BOM) in the $20 range ...


UC Regents Fall 2008 © UCBCS 194-6 L9: Advanced Processors I

Key Issue: Overcoming data hazardsRead After Write (RAW) hazards. Instruction I2 expects to read a data value written by an earlier instruction, but I2 executes “too early” and reads the wrong copy of the data.

Write After Read (WAR) hazards. Instruction I2 expects to write over a data value after an earlier instruction I1 reads it. But instead, I2 writes too early, and I1 sees the new value.

Write After Write (WAW) hazards. Instruction I2 writes over data an earlier instruction I1 also writes. But instead, I1 writes after I2, and the final data value is incorrect.



Key issue: Structural Hazards ...

120D

igital Technical JournalVol. 7

No. 1

1995

INSTRUCTION STREAM FILL REFILL

BUFFER

NEXT INDEX LOGIC

8-KB, 32-BYTE BLOCK, DIRECT-MAPPED INSTRUCTION CACHE

INSTRUCTION CACHE ADDRESS LOGIC

INSTRUCTION TRANSLATION BUFFER

48-ENTRY ASSOCIATIVE

0

1

INSTRUCTION BUFFER

INSTRUCTION SLOT LOGIC

INTEGER REGISTER FILE

ISSUE SCOREBOARD LOGIC

PIPELINE STAGESS–1 S0 S1 S2 S3 S4

LOAD DATA

FLOATING- POINT REGISTER FILE

INTEGER MULTIPLIER

INTEGER PIPE 0ADD, LOG, SHIFT, LD, ST, IMUL, CMP, CMOV, BYTE, WORD

INTEGER PIPE 1 ADD, LOG, LD, BR, CMP, CMOV

FLOATING-POINT STORE DATAINTEGER UNIT STORE DATA

8-KB, 32-BYTE BLOCK, DIRECT-MAPPED, DUAL READ-PORTED DATA CACHE (D-CACHE)

DUAL-READ TRANSLATION BUFFER

64-ENTRY ASSOCIATIVE DUAL-PORTED

STORE AND FILL DATA

FLOATING- POINT DIVIDER

FLOATING-POINT ADD PIPE AND DIVIDER

FLOATING-POINT MULTIPLY PIPE

MISS ADDRESS FILE

6 DATA MISSES

4 INSTRUCTION STREAM MISSES

WRITE BUFFER

SIX 32-BYTE ENTRIESSTORE

DATA

S5 S6

96-KB, 64-BYTE BLOCK, 3-WAY, SET-ASSOCIATIVE SECOND-LEVEL CACHE (S-CACHE)

ADDRESS TO PINS

S7

BUS ADDRESS FILE

TWO ENTRIES

DATA FROM PINS

INSTRUCTION AND DATA FILLS

1-MB TO 64-MB DIRECT-MAPPED BACKUP CACHE (B-CACHE)

CACHE CONTROL AND BUS INTERFACE UNiT

S8S9

MEMORY ADDRESS TRANSLATION UNIT

INTEGER EXECUTION UNIT

TO FLOATING-POINT UNIT

FLOATING-POINT EXECUTION UNIT

INSTRUCTION FETCH/DECODE UNIT

INSTRUCTION STREAM MISS (PHYSICAL ADDRESS)

Figure 1Five Functional Units on the Alpha 21164 Microprocessor

Floating Point Pipeline of Alpha 21164:Insufficient register write ports to service all sources every clock cycle.

Not every arithmetic unit is fully pipelined.


Topic #1: CPU side of our hit-over-miss cache ...From CPU To CPU

Queue 1 Queue 2

CPU requestsa read by placing MTYPE, TAG, MADDR in Queue 1.

We do a normal cache access. If there is a hit, we put place load result in Queue 2 ...

In the case of a miss, we use the Inverted Miss Status Holding Register.

“We” == L1 D-Cache controller



Integrating queues into the pipeline ...1600

IEEEJOURNALOFSOLID-STATECIRCUITS,VOL.36,NO.11,NOVEMBER2001

Fig.1.ProcessSEMcrosssection.

Theprocess

wasraisedfrom[1]tolimitstandbypower.

Circuitdesignandarchitecturalpipeliningensurelowvoltage

performanceandfunctionality.Tofurtherlimitstandbycurrent

inhandheldASSPs,alongerpolytargettakesadvantageofthe

versus

dependenceandsource-to-bodybiasisused

toelectricallylimittransistor

instandbymode.Allcore

nMOSandpMOStransistorsutilizeseparatesourceandbulk

connectionstosupportthis.Theprocessincludescobaltdisili-

cidegatesanddiffusions.Lowsourceanddraincapacitance,as

wellas3-nmgate-oxidethickness,allowhighperformanceand

low-voltageoperation. III.ARCHITECTURE

Themicroprocessorcontains32-kBinstructionanddata

cachesaswellasaneight-entrycoalescingwritebackbuffer.

Theinstructionanddatacachefillbuffershavetwoandfour

entries,respectively.Thedatacachesupportshit-under-miss

operationandlinesmaybelockedtoallowSRAM-likeoper-

ation.Thirty-two-entryfullyassociativetranslationlookaside

buffers(TLBs)thatsupportmultiplepagesizesareprovided

forbothcaches.TLBentriesmayalsobelocked.A128-entry

branchtargetbufferimprovesbranchperformanceapipeline

deeperthanearlierhigh-performanceARMdesigns[2],[3].

A.PipelineOrganization

Toobtainhighperformance,themicroprocessorcoreutilizes

asimplescalarpipelineandahigh-frequencyclock.Inaddition

toavoidingthepotentialpowerwasteofasuperscalarapproach,

functionaldesignandvalidationcomplexityisdecreasedatthe

expenseofcircuitdesigneffort.Toavoidcircuitdesignissues,

thepipelinepartitioningbalancestheworkloadandensuresthat

noonepipelinestageistight.Themainintegerpipelineisseven

stages,memoryoperationsfollowaneight-stagepipeline,and

whenoperatinginthumbmodeanextrapipestageisinserted

afterthelastfetchstagetoconvertthumbinstructionsintoARM

instructions.Sincethumbmodeinstructions[11]are16b,two

instructionsarefetchedinparallelwhileexecutingthumbin-

structions.Asimplifieddiagramoftheprocessorpipelineis

Fig.2.Microprocessorpipelineorganization.

showninFig.2,wherethestateboundariesareindicatedby

gray.Featuresthatallowthemicroarchitecturetoachievehigh

speedareasfollows.

TheshifterandALUresideinseparatestages.TheARMin-

structionsetallowsashiftfollowedbyanALUoperationina

singleinstruction.Previousimplementationslimitedfrequency

byhavingtheshiftandALUinasinglestage.Splittingthisop-

erationreducesthecriticalALUbypasspathbyapproximately

1/3.Theextrapipelinehazardintroducedwhenaninstructionis

immediatelyfollowedbyonerequiringthattheresultbeshifted

isinfrequent.

DecoupledInstructionFetch.Atwo-instructiondeepqueueis

implementedbetweenthesecondfetchandinstructiondecode

pipestages.Thisallowsstallsgeneratedlaterinthepipetobe

deferredbyoneormorecyclesintheearlierpipestages,thereby

allowinginstructionfetchestoproceedwhenthepipeisstalled,

andalsorelievesstallspeedpathsintheinstructionfetchand

branchpredictionunits.

Deferredregisterdependency

stalls.Whileregisterdepen-

denciesarecheckedintheRFstage,stallsduetothesehazards

aredeferreduntiltheX1stage.Allthenecessaryoperandsare

thencapturedfromresult-forwardingbussesastheresultsare

returnedtotheregisterfile.

Oneofthemajorgoalsofthedesignwastominimizetheen-

ergyconsumedtocompleteagiventask.Conventionalwisdom

hasbeenthatshorterpipelinesaremoreefficientduetore-

A memory pipe splits off from the main pipeline, after ALU calculates index.

Queue 1 Queue 2

CPU uses 5 bits of TAG to encode the target/source register for LW/SW.



LockBits: a scoreboard data structure1600



Theprocess





versus




































speedareasfollows.








isinfrequent.

















LockBits

WE

wsrs

wdrd

1

1

5

5

Each register has a lock bit, initialized to 0.An example of a scoreboard

data structure.

In decode stage, we stall any

instruction that reads or writes

a locked register.

In decode stage, we lock target register of any LW we issue.



How lock bits are cleared ...

From CPU To CPU

Queue 1 Queue 2

When data is returned to CPU via Queue 2, CPU writes data into register file, and clears

the associated lock bit.

1600



Theprocess





versus




































speedareasfollows.








isinfrequent.

















LockBits

WE

wsrs

wdrd

1

1

5

5

Dedicated write ports are needed to avoid structural hazards.



Memory semantics and lock-free caches

From CPU To CPU

Queue 1 Queue 2

The CPU expects that loads and stores to the same memory location are applied in queued order.

The simple (low-performance) approach for the data cache is to “snoop” Queue 1, and delay

accepting writes to addresses that are being read.Finally, note the lack of sequential consistency.



Topic #2: Pipelines and latency ...This pipeline splits after the RF stage,

feeding functional units with different latencies.

1600



Theprocess





versus




































speedareasfollows.








isinfrequent.



















Split pipelines: a write-after-write hazard.

The pipeline splits after the RF stage, feeding functional units

with different latencies.

1600



Theprocess





versus




































speedareasfollows.








isinfrequent.

















WAW Hazard

SUB R1, R2, R3DIV R1, R2, R3

If long latency DIV and short latency SUB are sent to

parallel pipes, SUB may finish first.

Solution: SUB detects R1 clash in decode stage and stalls, via a pipe-write scoreboard.



Register write port: a structural hazard

Other solutions possible ... above, solution of separate write ports.

1600



Theprocess





versus




































speedareasfollows.








isinfrequent.

















Structural Hazard

[...]DIV R1, R2, R3

DIV and SUB may need to write register file at the same time.

SUB R5, R2, R3

Solution: A scoreboard structure to reserve future slots of the write port. Stall SUB in decode until slot opens.



Functional unit input: a structural hazard



1600



Theprocess





versus




































speedareasfollows.








isinfrequent.

















DIV R1, R2, R3

Structural Hazard

Divide is usually not fully pipelined, and cannot accept new

operands every cycle.

DIV R5, R2, R3

Solution: A scoreboard structure to detect busy functional units. Stall DIV R5, ... in decode until divider is ready.



Imprecise exceptions: A difficult issue



1600



Theprocess





versus




































speedareasfollows.








isinfrequent.

















Exceptions

SUB R4, R2, R3DIV R1, R2, R3

If DIV throws an exception after SUB

writes back, the contract with the

programmer breaks.

Solutions: Too complicated for a slide. See page C-58 in CA-AQA



Seconds

Program

Instructions

Program= Seconds

Cycle Instruction

Cycles

Goal: Improve CPI by issuing several instructions per cycle.

Difficulties: Load and branchdelays affect more instructions.Ultimate Limiter: Programs maybe a poor match to issue rules.

Page 3

Krste

March

10, 2

004

6.8

23, L

11--5

Fu

nc

tion

Un

it Ch

ara

cte

ris

tics

bu

sy

bu

sy

2 c

yc

2 c

yc

1c

yc

1c

yc

1c

yc

fully

pipelined

partially

pipelined

Fu

nctio

n u

nits

have in

tern

al p

ipelin

e r

eg

iste

rs

!""o

peran

ds a

re la

tch

ed

wh

en

an

instr

uctio

n

en

ters a

fun

ctio

n u

nit

!""in

pu

ts to

a fu

nctio

n u

nit (e

.g., r

eg

iste

r file

)

can

ch

an

ge d

urin

g a

lon

g la

ten

cy o

peratio

n

ac

ce

pt

ac

ce

pt

Krste

March

10, 2

004

6.8

23, L

11--6

Mu

ltiple

Fu

nctio

n U

nits

IFID

WB

AL

UM

em

Fad

d

Fm

ul

Fd

iv

Issu

e

GP

R’s

FP

R’s

Example: CPU with floating point ALUs: Issue 1 FP + 1 Integer instruction per cycle.

Superscalar: Multiple issues per cycle


UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU

Recall VLIW: Super-sized Instructions

Example: All instructions are 64-bit. Each instruction consists of two 32-bit MIPS instructions, that execute in parallel.

opcode rs rt rd functshamt

opcode rs rt rd functshamt

Syntax: ADD $8 $9 $10 Semantics:$8 = $9 + $10

Syntax: ADD $7 $8 $9 Semantics:$7 = $8 + $9

A 64-bit VLIW instruction

But what if we can’t change ISA execution semantics ?



IR IR

IF (Fetch) ID (Decode) EX (ALU)

IR IR

MEM WB

IR IR


IR IR

MEM WB

rd1

RegFile

rd2

WE1

wd1

rs1

rs2

ws1

WE2

rd3

rd4

rs3

rs4

wd2

ws2

A

B

A

B

32A

L

U

32

32

op

Y

32A

L

U

32

32

op

Y

R

R

Superscalar R machine

Addr

DataInstrMem

64

32PC and

Sequencer

Instruction Issue Logic



IR IR


IR IR

MEM WB

IR IR

ID (Decode) EX (ALU)

IR IR

MEM WB

rd1

RegFile

rd2

WE1

wd1

rs1

rs2

ws1

WE2

rd3

rd4

rs3

rs4

wd2

ws2

A

B

A

B

32A

L

U

32

32

op

Y

32A

L

U

32

32

op

Y

R

R

Sustaining Dual Instr Issues

(no forwarding)

ADD R21,R20,R19ADD R24,R23,R22

ADD R21,R20,R19

ADD R24,R23,R22


ADD R15,R14,R13

ADD R18,R17,R16


ADD R27

ADD R30

ADD R9,R8,R7

ADD R12,R11,R10



It’s rarely this good ...



IR IR


IR IR

MEM WB

IR IR

ID (Decode) EX (ALU)

IR IR

MEM WB

rd1

RegFile

rd2

WE1

wd1

rs1

rs2

ws1

WE2

rd3

rd4

rs3

rs4

wd2

ws2

A

B

A

B

32A

L

U

32

32

op

Y

32A

L

U

32

32

op

Y

R

R

Worst-Case Instruction Issue

NOP

ADD R8,

ADD R8,R0,R0

ADD R9,R8,R0

ADD R9,R8,R0

ADD R10,R9,R0

ADD R10,R9,R0

ADD R11,R10,R0

ADD R11,R10,R0

NOP NOP NOP

Dependencies force

“serialization”

We add 12 forwarding buses (not shown).(6 to each ID from stages of both pipes).



Page 3

Krste

March 10, 2004

6.823, L11--5

Function Unit Characteristics

busy

busy

2 cyc 2 cyc

1cyc 1cyc 1cyc

fully

pipelined

partially

pipelined

Function units have internal pipeline registers

!"" operands are latched when an instruction

enters a function unit

!"" inputs to a function unit (e.g., register file)

can change during a long latency operation

accept

accept

Krste

March 10, 2004

6.823, L11--6

Multiple Function Units

IF ID WB

ALU Mem

Fadd

Fmul

Fdiv

Issue

GPR’s

FPR’s

Example: Superscalar MIPS. Fetches 2 instructions at a time. If first integer and second floating point, issue in same cycle

Superscalar: A simple example ...

Integer instruction FP instruction

LD F0,0(R1) LD F6,-8(R1) LD F10,-16(R1) ADDD F4,F0,F2 LD F14,-24(R1) ADDD F8,F6,F2 LD F18,-32(R1) ADDD F12,F10,F2 SD 0(R1),F4 ADDD F16,F14,F2 SD -8(R1),F8 ADDD F20,F18,F2 SD -16(R1),F12 SD -24(R1),F16

Two issuesper cycle

One issueper cycle

Why is the control for this CPU not so hard to do?



Page 3

Krste

March 10, 2004

6.823, L11--5

Function Unit Characteristics

busy

busy

2 cyc 2 cyc

1cyc 1cyc 1cyc

fully

pipelined

partially

pipelined

Function units have internal pipeline registers

!"" operands are latched when an instruction

enters a function unit

!"" inputs to a function unit (e.g., register file)

can change during a long latency operation

accept

accept

Krste

March 10, 2004

6.823, L11--6

Multiple Function Units

IF ID WB

ALU Mem

Fadd

Fmul

Fdiv

Issue

GPR’s

FPR’s

Three instructions potentially affected by a single cycle of load delay, as FP register loads done in the “integer” pipeline).

Superscalar: Visualizing the pipeline

Type Pipe Stages Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB



Limitations of “lockstep” superscalarGets 0.5 CPI only for a 50/50 float/int mix with no hazards. For games/media, may be OK.Extending scheme to speed up general apps (Microsoft Office, ...) is complicated.If one accepts building a complicated machine, there are better ways to do it.

The Power5 scans fetched instructions forbranches (BP stage), and if it finds a branch,predicts the branch direction using threebranch history tables shared by the twothreads. Two of the BHTs use bimodal andpath-correlated branch prediction mecha-nisms to predict branch directions.6,7 Thethird BHT predicts which of these predictionmechanisms is more likely to predict the cor-

rect direction.7 If the fetched instructions con-tain multiple branches, the BP stage can pre-dict all the branches at the same time. Inaddition to predicting direction, the Power5also predicts the target of a taken branch inthe current cycle’s eight-instruction group. Inthe PowerPC architecture, the processor cancalculate the target of most branches from theinstruction’s address and offset value. For

43MARCH–APRIL 2004

MP ISS RF EA DC WB Xfer

MP ISS RF EX WB Xfer


MP ISS RF

XferF6

Group formation andinstruction decode

Instruction fetch

Branch redirects

Interrupts and flushes

WB

Fmt

D1 D2 D3 Xfer GD

BPICCP

D0

IF

Branchpipeline

Load/storepipeline

Fixed-pointpipeline

Floating-point pipeline

Out-of-order processing

Figure 3. Power5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA =compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, andCP = group commit).

Shared by two threads Thread 0 resources Thread 1 resources

LSU0FXU0

LSU1

FXU1

FPU0

FPU1

BXU

CRL

Dynamicinstructionselection

Threadpriority

Group formationInstruction decode

Dispatch

Shared-register

mappers

Readshared-

register files

Sharedissue

queues

Sharedexecution

units

Alternate

Branch prediction

Instructioncache

Instructiontranslation

Programcounter

Branchhistorytables

Returnstack

Targetcache

DataCache

DataTranslation

L2cache

Datacache

Datatranslation

Instructionbuffer 0

Instructionbuffer 1

Writeshared-

register files

Groupcompletion

Storequeue

Figure 4. Power5 instruction data flow (BXU = branch execution unit and CRL = condition register logical execution unit).

DynamicScheduling:

After spring break.


120D

igital Technical JournalVol. 7

No. 1

1995

INSTRUCTION STREAM FILL REFILL

BUFFER

NEXT INDEX LOGIC

8-KB, 32-BYTE BLOCK, DIRECT-MAPPED INSTRUCTION CACHE

INSTRUCTION CACHE ADDRESS LOGIC

INSTRUCTION TRANSLATION BUFFER

48-ENTRY ASSOCIATIVE

0

1

INSTRUCTION BUFFER

INSTRUCTION SLOT LOGIC

INTEGER REGISTER FILE

ISSUE SCOREBOARD LOGIC

PIPELINE STAGESS–1 S0 S1 S2 S3 S4

LOAD DATA

FLOATING- POINT REGISTER FILE

INTEGER MULTIPLIER

INTEGER PIPE 0ADD, LOG, SHIFT, LD, ST, IMUL, CMP, CMOV, BYTE, WORD

INTEGER PIPE 1 ADD, LOG, LD, BR, CMP, CMOV

FLOATING-POINT STORE DATAINTEGER UNIT STORE DATA

8-KB, 32-BYTE BLOCK, DIRECT-MAPPED, DUAL READ-PORTED DATA CACHE (D-CACHE)

DUAL-READ TRANSLATION BUFFER

64-ENTRY ASSOCIATIVE DUAL-PORTED

STORE AND FILL DATA

FLOATING- POINT DIVIDER

FLOATING-POINT ADD PIPE AND DIVIDER

FLOATING-POINT MULTIPLY PIPE

MISS ADDRESS FILE

6 DATA MISSES

4 INSTRUCTION STREAM MISSES

WRITE BUFFER

SIX 32-BYTE ENTRIESSTORE

DATA

S5 S6

96-KB, 64-BYTE BLOCK, 3-WAY, SET-ASSOCIATIVE SECOND-LEVEL CACHE (S-CACHE)

ADDRESS TO PINS

S7

BUS ADDRESS FILE

TWO ENTRIES

DATA FROM PINS

INSTRUCTION AND DATA FILLS

1-MB TO 64-MB DIRECT-MAPPED BACKUP CACHE (B-CACHE)

CACHE CONTROL AND BUS INTERFACE UNiT

S8S9

MEMORY ADDRESS TRANSLATION UNIT

INTEGER EXECUTION UNIT

TO FLOATING-POINT UNIT

FLOATING-POINT EXECUTION UNIT

INSTRUCTION FETCH/DECODE UNIT

INSTRUCTION STREAM MISS (PHYSICAL ADDRESS)

Figure 1Five Functional Units on the Alpha 21164 Microprocessor

DEC Alpha 21164

This 4-issue chip was the high watermark for in-ordersuperscalar designs.


RISC versus CISC: A Tale of Two ChipsDileep BhandarkarIntel Corporation

Santa Clara, California, USA

AbstractThis paper compares an aggressive RISC and CISCimplementation built with comparable technology.The two chips are the Alpha* 21164 and the IntelPentium® Pro processor. The paper presentsperformance comparisons for industry standardbenchmarks and uses performance counter statisticsto compare various aspects of both designs.

IntroductionIn 1991, Bhandarkar and Clark published a papercomparing an example implementation from the RISCand CISC architectural schools (a MIPS* M/2000 and aDigital VAX* 8700) on nine of the ten SPEC89benchmarks. The organizational similarity of thesemachines provided an opportunity to examine thepurely architectural advantages of RISC. That papershowed that the resulting advantage in cycles perprogram ranged from slightly under a factor of 2 toalmost a factor of 4, with a geometric mean of 2.7. Thispaper attempts yet another comparison of a leadingRISC and CISC implementation, but using chipsdesigned with comparable semiconductor technology.The RISC chip chosen for this study is the DigitalAlpha 21164 [Edmondson95]. The CISC chip is theIntel Pentium® Pro processor [Colwell95]. The resultsshould not be used to draw sweeping conclusions aboutRISC and CISC in general. They should be viewed as asnapshot in time. Note that performance is alsodetermined by the system platform and compiler used.

Chip Overview

Table 1 shows the major characteristics of the twochips. Both chips are implemented in around 0.5µtechnology and the die size is comparable. The designapproach is quite different, but both represent state ofthe art implementations that achieved the highestperformance for RISC and CISC architecturesrespectively at the time of their introduction.

Table 1 Chip Comparison

Alpha21164

Pentium® ProProcessor

Architecture Alpha IA-32Clock Speed 300 MHz 150 MHzIssue Rate Four ThreeFunction Units four fiveOut of order issue no yesRename Registers none 40On-chip Cache 8 KB data

8KB instr96 KB Level 2

8 KB data8KB instr

Off chip cache 4 MB 256 KBBranch HistoryTable

2048 entries,2-bit history


TransistorsLogicTotal

1.8 million9.3 million


VLSI ProcessMin. GeometryMetal Layers

CMOS0.5 µ

4

BiCMOS0.6 µ

4Die Size 298 mm2 306 mm2

Package 499 pin PGA 387 pin PGAPower 50 W 20 W incl cacheFirst Silicon Feb. 94 4Q 94Volume Parts 1Q 95 4Q 95SPECint92/95 341/7.43 245/6.08SPECfp92/95 513/12.4 220/5.42SYSmark/NT 529 497

The 21164 is a quad issue superscalar design thatimplements two levels of caches on chip, but does notimplement out of order execution. The Pentium® Proprocessor implements dynamic execution using anout-of-order, speculative execution engine, withregister renaming of integer, floating point and flagsvariables. Consequently, even though the die size iscomparable, the total transistor count is quitedifferent for the two chips. The aggressive design ofthe Pentium Pro processor is much more logicintensive; and logic transistors are less dense. The on-chip 96 KB L2 cache of the 21164 inflates itstransistor count. Even though the Alpha 21164 has anon-chip L2 cache, most systems use a 2 or 4 MBboard level cache to achieve their performance goal.

RISC versus CISC: A Tale of Two ChipsDileep BhandarkarIntel Corporation

Santa Clara, California, USA

AbstractThis paper compares an aggressive RISC and CISCimplementation built with comparable technology.The two chips are the Alpha* 21164 and the IntelPentium® Pro processor. The paper presentsperformance comparisons for industry standardbenchmarks and uses performance counter statisticsto compare various aspects of both designs.

IntroductionIn 1991, Bhandarkar and Clark published a papercomparing an example implementation from the RISCand CISC architectural schools (a MIPS* M/2000 and aDigital VAX* 8700) on nine of the ten SPEC89benchmarks. The organizational similarity of thesemachines provided an opportunity to examine thepurely architectural advantages of RISC. That papershowed that the resulting advantage in cycles perprogram ranged from slightly under a factor of 2 toalmost a factor of 4, with a geometric mean of 2.7. Thispaper attempts yet another comparison of a leadingRISC and CISC implementation, but using chipsdesigned with comparable semiconductor technology.The RISC chip chosen for this study is the DigitalAlpha 21164 [Edmondson95]. The CISC chip is theIntel Pentium® Pro processor [Colwell95]. The resultsshould not be used to draw sweeping conclusions aboutRISC and CISC in general. They should be viewed as asnapshot in time. Note that performance is alsodetermined by the system platform and compiler used.

Chip Overview

Table 1 shows the major characteristics of the twochips. Both chips are implemented in around 0.5µtechnology and the die size is comparable. The designapproach is quite different, but both represent state ofthe art implementations that achieved the highestperformance for RISC and CISC architecturesrespectively at the time of their introduction.

Table 1 Chip Comparison

Alpha21164

Pentium® ProProcessor

Architecture Alpha IA-32Clock Speed 300 MHz 150 MHzIssue Rate Four ThreeFunction Units four fiveOut of order issue no yesRename Registers none 40On-chip Cache 8 KB data

8KB instr96 KB Level 2

8 KB data8KB instr

Off chip cache 4 MB 256 KBBranch HistoryTable



TransistorsLogicTotal



VLSI ProcessMin. GeometryMetal Layers

CMOS0.5 µ

4

BiCMOS0.6 µ

4Die Size 298 mm2 306 mm2

Package 499 pin PGA 387 pin PGAPower 50 W 20 W incl cacheFirst Silicon Feb. 94 4Q 94Volume Parts 1Q 95 4Q 95SPECint92/95 341/7.43 245/6.08SPECfp92/95 513/12.4 220/5.42SYSmark/NT 529 497

The 21164 is a quad issue superscalar design thatimplements two levels of caches on chip, but does notimplement out of order execution. The Pentium® Proprocessor implements dynamic execution using anout-of-order, speculative execution engine, withregister renaming of integer, floating point and flagsvariables. Consequently, even though the die size iscomparable, the total transistor count is quitedifferent for the two chips. The aggressive design ofthe Pentium Pro processor is much more logicintensive; and logic transistors are less dense. The on-chip 96 KB L2 cache of the 21164 inflates itstransistor count. Even though the Alpha 21164 has anon-chip L2 cache, most systems use a 2 or 4 MBboard level cache to achieve their performance goal.

.

Alpha 21164 (12.1 SPECint95) and 200 MHzPentium® Pro (8.71 SPECint95) processors, circaOctober 1996. The results show that while the Alphasystem is 45% faster on SPECint_rate95, it is 8%slower on the TPC-C benchmark with a 59% higher$/tpmC using the same database software!

Table 4 TPC-C Performance

CompaqProLiant 5000Model 6/200

DigitalAlphaServer4100 5/400

CPUs Four 200 MHzPentium Proprocessors

Four 400 MHzAlpha 21164processors

L2 cache 512KB 4 MBSPECint_rate95 292 (est)2 422TPC-C perf 8311 tpmC @

$95.32/tpmC7598 tpmC @$152.04/tpmC

Operating Sys SCO UnixWare Digital UNIXDatabase Sybase SQL

Server 11.0Sybase SQLServer 11.0

Concluding RemarksStudies like this one offer some insight into theperformance characteristics of different instruction setarchitectures and attempts to implement them well ina comparable technology. The overall performance isaffected by many factors and strict cause-effectrelationships are hard to pinpoint. Such explorationsare also hindered by the lack of measured data oncommon workloads for systems designed by differentcompanies. This study would have been moremeaningful if more stressful environments like on-line transaction processing and computer aided designcould have been analyzed in detail. Nevertheless, itdoes provide new quantitative data, that can be usedto get a better understanding of the performancedifferences between a premier RISC and CISCimplementation.

Using a comparable die size, the Pentium® Proprocessor achieves 80 to 90% of the performance ofthe Alpha 21164 on integer benchmarks andtransaction processing workloads. It uses anaggressive out-of-order design to overcome theinstruction set level limitations of a CISCarchitecture. On floating-point intensive benchmarks,the Alpha 21164 does achieve over twice theperformance of the Pentium Pro processor.

2 measured result for Fujitsu ICL Superserver J654i using

the same processor.

AcknowledgmentsThe author is grateful to Jeff Reilly and Mason Guyof Intel for collecting the performance countermeasurement data for the Pentium® Pro processor,and Zarka Cvetanovic of Digital EquipmentCorporation for providing the performance countermeasurement data for the Alpha 21164.

References[Bannon95] P. Bannon and J. Keller, “InternalArchitecture of Alpha 21164 Microprocessor”, Proc.Compcon Spring 95, Mar 1995.

[Bhandarkar91] D. Bhandarkar and D. Clark,“Performance from Architecture: Comparing a RISCand a CISC with Similar Hardware Organization,”Proceedings of ASPLOS-IV, April 1991.

[Bhandarkar95] D. Bhandarkar, “AlphaImplementations and Architecture: CompleteReference and Guide”, 1995, ISBN: 1-55558-130-7,Digital Press, Newton, MA.

[Bhandarkar97] D. Bhandarkar and J. Ding,“Performance Characterization of the Pentium ProProcessor,” Proceedings of HPCA-3, February 1997.

[Colwell95] R. Colwell and R. Steck, “A 0.6umBiCMOS Processor with Dynamic Execution”,ISSCC Proceedings, pp 176-177, February 1995.

[Cvetanovic96] Z. Cvetanovic and D. Bhandarkar,“Performance Characterization of the Alpha 21164Microprocessor using TP and SPEC Workloads,”Proceedings of HPCA-2, February 1996.

[Edmondson95] J. Edmondson et al, “SuperscalarInstruction Execution in the 21164 Microprocessor”,IEEE Micro, April 1995, pp.33-43.

[Papworth96] D. Papworth, “Tuning The Pentium®

Pro Microarchitecture,” IEEE Micro, April 1996, pp.8-15.

[Yeh91] Tse-Yu Yeh and Yale Patt, “Two-LevelAdaptive Training Branch Prediction,” Proc. IEEEMicro-24, Nov 1991, pp. 51-61. * Intel® and Pentium® are registered trademarks ofIntel Corporation. Other brands and names are theproperty of their respective owners.

Final paragraph

DEC was sold off to Compaq a few years later ... who sold of Digital Semiconductor to

Intel ... who still makes Alpha chips in small batches for HP (who bought Compaq).


UC Regents Spring 2014 © UCBCS 152 L15: Superscalars and Scoreboards

Break

Play:28Tuesday, March 11, 14

The CDC 6600 was the world’s fastest computer for 5 years (1964-1969).

The design team was located in a small town in Wisconsin, the home town of its leader, Seymour Cray.

The lab was placed far from CDC headquarters in Minneapolis, to limit

interference from upper management.29Tuesday, March 11, 14

Operator Console

Mainframe

Top view: a “+” sign

Tape Drives

Punched card reader


Top-down view: Entire main frame was liquid-cooled with Freon.

Transistor-based design, running at 100 ns clock speed.

64K of 60-bit words, implemented with magnetic core memory.

Bus wires: twisted wire pairs that were trimmed by hand to meet cycle time.


Museum collection ...


First commercial

use of display

consoles ... ran “space wars” vector games.


Freon cooling control panel.


Twisted pair bus wires.

Trimmed by hand.


Magnetic core memory module


Memory modules were hand-woven by former textile workers ... this is why machine cost

$7M in 1962 dollars!


Logic gate circuit modules ...

50 transistors: 2.5 x 2.5 x 0.8 inch


Peripheral processor invented

multithreading

Out-of-order execution.

“Scoreboard”

10 functional units

Long, variable latency

Register File

The first RISC machine

Includes eight 60-bit floating point

registers

Architecture


Instruction Fetch and the Scoreboard

The scoreboard controls the execution flow of all instructions. It’s goal is to maintain a CPI of 1.

The instruction fetch unit is decoupled. It’s goal is to pass one decoded instruction

to the scoreboard every cycle.The scoreboard holds decoded copies of all in-flight instructions, and tracks

the status of all elements cycle-by-cycle.40Tuesday, March 11, 14

Pending Issue

Awaiting operands

Execution in progress

Execution has

completed

Result is

written

Lifecycle of an

instruction in the

scoreboard (part 1)

Newly arrived instructions placed in this state, until

(1) a functional unit becomes free, and (2) no other issued instructions want

to write the register it wants to write.

If an instruction is in pending issue, the scoreboard stalls the instruction fetch unit.

Prevents WAW hazards.


Pending Issue

Awaiting operands


Execution has

completed

Result is

written

Lifecycle of an

instruction in the

scoreboard (part 2)

Instructions remain in this state, until both of its operand registers are

not waiting to be written by a functional unit.

Prevents RAW hazards.


Pending Issue

Awaiting operands


Execution has

completed

Result is

written

Lifecycle of an

instruction in the

scoreboard (part 3)

This state can last many cycles, as functional units have long latency.


Pending Issue

Awaiting operands


Execution has

completed

Result is

written

Lifecycle of an

instruction in the

scoreboard (part 4)

Instructions may pass though this state, unless there is an instruction is Pending or Awaiting mode that

(1) preceded it in the instruction stream,(2) Pending/Awaiting instruction needs to read the register this instruction

plans to write.

Prevents WAR hazards.44Tuesday, March 11, 14

What the scoreboard

keeps score of.

The full status of each functional unit.(1) Is it running an instruction? Which one?

(2) What are its source/destination registers?(3) For each source: waiting/ready-to-read/read.

(4) For each source: who will be writing it?

For each register, which functional unit is planning to write it?

Current state of all in-flight instructions.



Limitations of scoreboard control ...

If one accepts building a complicated machine, there are better ways to do it.

The Power5 scans fetched instructions forbranches (BP stage), and if it finds a branch,predicts the branch direction using threebranch history tables shared by the twothreads. Two of the BHTs use bimodal andpath-correlated branch prediction mecha-nisms to predict branch directions.6,7 Thethird BHT predicts which of these predictionmechanisms is more likely to predict the cor-

rect direction.7 If the fetched instructions con-tain multiple branches, the BP stage can pre-dict all the branches at the same time. Inaddition to predicting direction, the Power5also predicts the target of a taken branch inthe current cycle’s eight-instruction group. Inthe PowerPC architecture, the processor cancalculate the target of most branches from theinstruction’s address and offset value. For

43MARCH–APRIL 2004

MP ISS RF EA DC WB Xfer



MP ISS RF

XferF6

Group formation andinstruction decode

Instruction fetch

Branch redirects

Interrupts and flushes

WB

Fmt

D1 D2 D3 Xfer GD

BPICCP

D0

IF

Branchpipeline

Load/storepipeline

Fixed-pointpipeline

Floating-point pipeline

Out-of-order processing

Figure 3. Power5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA =compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, andCP = group commit).

Shared by two threads Thread 0 resources Thread 1 resources

LSU0FXU0

LSU1

FXU1

FPU0

FPU1

BXU

CRL

Dynamicinstructionselection

Threadpriority

Group formationInstruction decode

Dispatch

Shared-register

mappers

Readshared-

register files

Sharedissue

queues

Sharedexecution

units

Alternate

Branch prediction

Instructioncache

Instructiontranslation

Programcounter

Branchhistorytables

Returnstack

Targetcache

DataCache

DataTranslation

L2cache

Datacache

Datatranslation

Instructionbuffer 0

Instructionbuffer 1

Writeshared-

register files

Groupcompletion

Storequeue

Figure 4. Power5 instruction data flow (BXU = branch execution unit and CRL = condition register logical execution unit).

DynamicScheduling:

After spring break.


On Thursday

Midterm Review Lecture


Documents

CS 152 Computer Architecture and Engineeringinst.eecs.berkeley.edu/~cs152/sp14/lecnotes/lec8-1.pdf · DEC Alpha 21164 Lockup-free cache integration. Uses techniques we cover in Part