Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
UC Regents Spring 2014 © UCBCS 152 L15: Superscalars and Scoreboards
2014-3-11John Lazzaro
(not a prof - “John” is always OK)
CS 152Computer Architecture and Engineering
www-inst.eecs.berkeley.edu/~cs152/
TA: Eric Love
Lecture 15 -- Advanced CPUs
Play:1Tuesday, March 11, 14
DEC Alpha 21164
Top performing microprocessor in its day (1995).
300 MFLOPS in 0.5µ CMOS,@ 300 MHz.
2Tuesday, March 11, 14
DEC Alpha 21164
Lockup-free cache integration.
Uses techniques we cover in Part I of lecture.
Use of many functional units.
Many instructions issued per cycle (superscalar)
3Tuesday, March 11, 14
DEC Alpha 21164
Most of chip is cache (in blue).
This 4-issue chip was the high watermark for in-order designs.
In 2014,in-order superscalar lives in the cost-sensitive sector ...
4Tuesday, March 11, 14
Marvell Embedded CPU: In-order dual-core superscalar
Chromecast:Web browser in a flash-drive form factor. Plugs into the HDMI port on a TV. Includes a Wi-Fi chip so you can control the browser from your cell phone.
Wi-Fi ARM CPU(Marvell)
512 MB DRAM
2 GB Flash
$35 retail implies Bill of
Materials (BOM) in the $20 range ...
5Tuesday, March 11, 14
UC Regents Fall 2008 © UCBCS 194-6 L9: Advanced Processors I
Key Issue: Overcoming data hazardsRead After Write (RAW) hazards. Instruction I2 expects to read a data value written by an earlier instruction, but I2 executes “too early” and reads the wrong copy of the data.
Write After Read (WAR) hazards. Instruction I2 expects to write over a data value after an earlier instruction I1 reads it. But instead, I2 writes too early, and I1 sees the new value.
Write After Write (WAW) hazards. Instruction I2 writes over data an earlier instruction I1 also writes. But instead, I1 writes after I2, and the final data value is incorrect.
6Tuesday, March 11, 14
UC Regents Fall 2008 © UCBCS 194-6 L9: Advanced Processors I
Key issue: Structural Hazards ...
120D
igital Technical JournalVol. 7
No. 1
1995
INSTRUCTION STREAM FILL REFILL
BUFFER
NEXT INDEX LOGIC
8-KB, 32-BYTE BLOCK, DIRECT-MAPPED INSTRUCTION CACHE
INSTRUCTION CACHE ADDRESS LOGIC
INSTRUCTION TRANSLATION BUFFER
48-ENTRY ASSOCIATIVE
0
1
INSTRUCTION BUFFER
INSTRUCTION SLOT LOGIC
INTEGER REGISTER FILE
ISSUE SCOREBOARD LOGIC
PIPELINE STAGESS–1 S0 S1 S2 S3 S4
LOAD DATA
FLOATING- POINT REGISTER FILE
INTEGER MULTIPLIER
INTEGER PIPE 0ADD, LOG, SHIFT, LD, ST, IMUL, CMP, CMOV, BYTE, WORD
INTEGER PIPE 1 ADD, LOG, LD, BR, CMP, CMOV
FLOATING-POINT STORE DATAINTEGER UNIT STORE DATA
8-KB, 32-BYTE BLOCK, DIRECT-MAPPED, DUAL READ-PORTED DATA CACHE (D-CACHE)
DUAL-READ TRANSLATION BUFFER
64-ENTRY ASSOCIATIVE DUAL-PORTED
STORE AND FILL DATA
FLOATING- POINT DIVIDER
FLOATING-POINT ADD PIPE AND DIVIDER
FLOATING-POINT MULTIPLY PIPE
MISS ADDRESS FILE
6 DATA MISSES
4 INSTRUCTION STREAM MISSES
WRITE BUFFER
SIX 32-BYTE ENTRIESSTORE
DATA
S5 S6
96-KB, 64-BYTE BLOCK, 3-WAY, SET-ASSOCIATIVE SECOND-LEVEL CACHE (S-CACHE)
ADDRESS TO PINS
S7
BUS ADDRESS FILE
TWO ENTRIES
DATA FROM PINS
INSTRUCTION AND DATA FILLS
1-MB TO 64-MB DIRECT-MAPPED BACKUP CACHE (B-CACHE)
CACHE CONTROL AND BUS INTERFACE UNiT
S8S9
MEMORY ADDRESS TRANSLATION UNIT
INTEGER EXECUTION UNIT
TO FLOATING-POINT UNIT
FLOATING-POINT EXECUTION UNIT
INSTRUCTION FETCH/DECODE UNIT
INSTRUCTION STREAM MISS (PHYSICAL ADDRESS)
Figure 1Five Functional Units on the Alpha 21164 Microprocessor
Floating Point Pipeline of Alpha 21164:Insufficient register write ports to service all sources every clock cycle.
Not every arithmetic unit is fully pipelined.
7Tuesday, March 11, 14
Topic #1: CPU side of our hit-over-miss cache ...From CPU To CPU
Queue 1 Queue 2
CPU requestsa read by placing MTYPE, TAG, MADDR in Queue 1.
We do a normal cache access. If there is a hit, we put place load result in Queue 2 ...
In the case of a miss, we use the Inverted Miss Status Holding Register.
“We” == L1 D-Cache controller
8Tuesday, March 11, 14
UC Regents Fall 2008 © UCBCS 194-6 L9: Advanced Processors I
Integrating queues into the pipeline ...1600
IEEEJOURNALOFSOLID-STATECIRCUITS,VOL.36,NO.11,NOVEMBER2001
Fig.1.ProcessSEMcrosssection.
Theprocess
wasraisedfrom[1]tolimitstandbypower.
Circuitdesignandarchitecturalpipeliningensurelowvoltage
performanceandfunctionality.Tofurtherlimitstandbycurrent
inhandheldASSPs,alongerpolytargettakesadvantageofthe
versus
dependenceandsource-to-bodybiasisused
toelectricallylimittransistor
instandbymode.Allcore
nMOSandpMOStransistorsutilizeseparatesourceandbulk
connectionstosupportthis.Theprocessincludescobaltdisili-
cidegatesanddiffusions.Lowsourceanddraincapacitance,as
wellas3-nmgate-oxidethickness,allowhighperformanceand
low-voltageoperation. III.ARCHITECTURE
Themicroprocessorcontains32-kBinstructionanddata
cachesaswellasaneight-entrycoalescingwritebackbuffer.
Theinstructionanddatacachefillbuffershavetwoandfour
entries,respectively.Thedatacachesupportshit-under-miss
operationandlinesmaybelockedtoallowSRAM-likeoper-
ation.Thirty-two-entryfullyassociativetranslationlookaside
buffers(TLBs)thatsupportmultiplepagesizesareprovided
forbothcaches.TLBentriesmayalsobelocked.A128-entry
branchtargetbufferimprovesbranchperformanceapipeline
deeperthanearlierhigh-performanceARMdesigns[2],[3].
A.PipelineOrganization
Toobtainhighperformance,themicroprocessorcoreutilizes
asimplescalarpipelineandahigh-frequencyclock.Inaddition
toavoidingthepotentialpowerwasteofasuperscalarapproach,
functionaldesignandvalidationcomplexityisdecreasedatthe
expenseofcircuitdesigneffort.Toavoidcircuitdesignissues,
thepipelinepartitioningbalancestheworkloadandensuresthat
noonepipelinestageistight.Themainintegerpipelineisseven
stages,memoryoperationsfollowaneight-stagepipeline,and
whenoperatinginthumbmodeanextrapipestageisinserted
afterthelastfetchstagetoconvertthumbinstructionsintoARM
instructions.Sincethumbmodeinstructions[11]are16b,two
instructionsarefetchedinparallelwhileexecutingthumbin-
structions.Asimplifieddiagramoftheprocessorpipelineis
Fig.2.Microprocessorpipelineorganization.
showninFig.2,wherethestateboundariesareindicatedby
gray.Featuresthatallowthemicroarchitecturetoachievehigh
speedareasfollows.
TheshifterandALUresideinseparatestages.TheARMin-
structionsetallowsashiftfollowedbyanALUoperationina
singleinstruction.Previousimplementationslimitedfrequency
byhavingtheshiftandALUinasinglestage.Splittingthisop-
erationreducesthecriticalALUbypasspathbyapproximately
1/3.Theextrapipelinehazardintroducedwhenaninstructionis
immediatelyfollowedbyonerequiringthattheresultbeshifted
isinfrequent.
DecoupledInstructionFetch.Atwo-instructiondeepqueueis
implementedbetweenthesecondfetchandinstructiondecode
pipestages.Thisallowsstallsgeneratedlaterinthepipetobe
deferredbyoneormorecyclesintheearlierpipestages,thereby
allowinginstructionfetchestoproceedwhenthepipeisstalled,
andalsorelievesstallspeedpathsintheinstructionfetchand
branchpredictionunits.
Deferredregisterdependency
stalls.Whileregisterdepen-
denciesarecheckedintheRFstage,stallsduetothesehazards
aredeferreduntiltheX1stage.Allthenecessaryoperandsare
thencapturedfromresult-forwardingbussesastheresultsare
returnedtotheregisterfile.
Oneofthemajorgoalsofthedesignwastominimizetheen-
ergyconsumedtocompleteagiventask.Conventionalwisdom
hasbeenthatshorterpipelinesaremoreefficientduetore-
A memory pipe splits off from the main pipeline, after ALU calculates index.
Queue 1 Queue 2
CPU uses 5 bits of TAG to encode the target/source register for LW/SW.
9Tuesday, March 11, 14
UC Regents Fall 2008 © UCBCS 194-6 L9: Advanced Processors I
LockBits: a scoreboard data structure1600
IEEEJOURNALOFSOLID-STATECIRCUITS,VOL.36,NO.11,NOVEMBER2001
Fig.1.ProcessSEMcrosssection.
Theprocess
wasraisedfrom[1]tolimitstandbypower.
Circuitdesignandarchitecturalpipeliningensurelowvoltage
performanceandfunctionality.Tofurtherlimitstandbycurrent
inhandheldASSPs,alongerpolytargettakesadvantageofthe
versus
dependenceandsource-to-bodybiasisused
toelectricallylimittransistor
instandbymode.Allcore
nMOSandpMOStransistorsutilizeseparatesourceandbulk
connectionstosupportthis.Theprocessincludescobaltdisili-
cidegatesanddiffusions.Lowsourceanddraincapacitance,as
wellas3-nmgate-oxidethickness,allowhighperformanceand
low-voltageoperation. III.ARCHITECTURE
Themicroprocessorcontains32-kBinstructionanddata
cachesaswellasaneight-entrycoalescingwritebackbuffer.
Theinstructionanddatacachefillbuffershavetwoandfour
entries,respectively.Thedatacachesupportshit-under-miss
operationandlinesmaybelockedtoallowSRAM-likeoper-
ation.Thirty-two-entryfullyassociativetranslationlookaside
buffers(TLBs)thatsupportmultiplepagesizesareprovided
forbothcaches.TLBentriesmayalsobelocked.A128-entry
branchtargetbufferimprovesbranchperformanceapipeline
deeperthanearlierhigh-performanceARMdesigns[2],[3].
A.PipelineOrganization
Toobtainhighperformance,themicroprocessorcoreutilizes
asimplescalarpipelineandahigh-frequencyclock.Inaddition
toavoidingthepotentialpowerwasteofasuperscalarapproach,
functionaldesignandvalidationcomplexityisdecreasedatthe
expenseofcircuitdesigneffort.Toavoidcircuitdesignissues,
thepipelinepartitioningbalancestheworkloadandensuresthat
noonepipelinestageistight.Themainintegerpipelineisseven
stages,memoryoperationsfollowaneight-stagepipeline,and
whenoperatinginthumbmodeanextrapipestageisinserted
afterthelastfetchstagetoconvertthumbinstructionsintoARM
instructions.Sincethumbmodeinstructions[11]are16b,two
instructionsarefetchedinparallelwhileexecutingthumbin-
structions.Asimplifieddiagramoftheprocessorpipelineis
Fig.2.Microprocessorpipelineorganization.
showninFig.2,wherethestateboundariesareindicatedby
gray.Featuresthatallowthemicroarchitecturetoachievehigh
speedareasfollows.
TheshifterandALUresideinseparatestages.TheARMin-
structionsetallowsashiftfollowedbyanALUoperationina
singleinstruction.Previousimplementationslimitedfrequency
byhavingtheshiftandALUinasinglestage.Splittingthisop-
erationreducesthecriticalALUbypasspathbyapproximately
1/3.Theextrapipelinehazardintroducedwhenaninstructionis
immediatelyfollowedbyonerequiringthattheresultbeshifted
isinfrequent.
DecoupledInstructionFetch.Atwo-instructiondeepqueueis
implementedbetweenthesecondfetchandinstructiondecode
pipestages.Thisallowsstallsgeneratedlaterinthepipetobe
deferredbyoneormorecyclesintheearlierpipestages,thereby
allowinginstructionfetchestoproceedwhenthepipeisstalled,
andalsorelievesstallspeedpathsintheinstructionfetchand
branchpredictionunits.
Deferredregisterdependency
stalls.Whileregisterdepen-
denciesarecheckedintheRFstage,stallsduetothesehazards
aredeferreduntiltheX1stage.Allthenecessaryoperandsare
thencapturedfromresult-forwardingbussesastheresultsare
returnedtotheregisterfile.
Oneofthemajorgoalsofthedesignwastominimizetheen-
ergyconsumedtocompleteagiventask.Conventionalwisdom
hasbeenthatshorterpipelinesaremoreefficientduetore-
LockBits
WE
wsrs
wdrd
1
1
5
5
Each register has a lock bit, initialized to 0.An example of a scoreboard
data structure.
In decode stage, we stall any
instruction that reads or writes
a locked register.
In decode stage, we lock target register of any LW we issue.
10Tuesday, March 11, 14
UC Regents Fall 2008 © UCBCS 194-6 L9: Advanced Processors I
How lock bits are cleared ...
From CPU To CPU
Queue 1 Queue 2
When data is returned to CPU via Queue 2, CPU writes data into register file, and clears
the associated lock bit.
1600
IEEEJOURNALOFSOLID-STATECIRCUITS,VOL.36,NO.11,NOVEMBER2001
Fig.1.ProcessSEMcrosssection.
Theprocess
wasraisedfrom[1]tolimitstandbypower.
Circuitdesignandarchitecturalpipeliningensurelowvoltage
performanceandfunctionality.Tofurtherlimitstandbycurrent
inhandheldASSPs,alongerpolytargettakesadvantageofthe
versus
dependenceandsource-to-bodybiasisused
toelectricallylimittransistor
instandbymode.Allcore
nMOSandpMOStransistorsutilizeseparatesourceandbulk
connectionstosupportthis.Theprocessincludescobaltdisili-
cidegatesanddiffusions.Lowsourceanddraincapacitance,as
wellas3-nmgate-oxidethickness,allowhighperformanceand
low-voltageoperation. III.ARCHITECTURE
Themicroprocessorcontains32-kBinstructionanddata
cachesaswellasaneight-entrycoalescingwritebackbuffer.
Theinstructionanddatacachefillbuffershavetwoandfour
entries,respectively.Thedatacachesupportshit-under-miss
operationandlinesmaybelockedtoallowSRAM-likeoper-
ation.Thirty-two-entryfullyassociativetranslationlookaside
buffers(TLBs)thatsupportmultiplepagesizesareprovided
forbothcaches.TLBentriesmayalsobelocked.A128-entry
branchtargetbufferimprovesbranchperformanceapipeline
deeperthanearlierhigh-performanceARMdesigns[2],[3].
A.PipelineOrganization
Toobtainhighperformance,themicroprocessorcoreutilizes
asimplescalarpipelineandahigh-frequencyclock.Inaddition
toavoidingthepotentialpowerwasteofasuperscalarapproach,
functionaldesignandvalidationcomplexityisdecreasedatthe
expenseofcircuitdesigneffort.Toavoidcircuitdesignissues,
thepipelinepartitioningbalancestheworkloadandensuresthat
noonepipelinestageistight.Themainintegerpipelineisseven
stages,memoryoperationsfollowaneight-stagepipeline,and
whenoperatinginthumbmodeanextrapipestageisinserted
afterthelastfetchstagetoconvertthumbinstructionsintoARM
instructions.Sincethumbmodeinstructions[11]are16b,two
instructionsarefetchedinparallelwhileexecutingthumbin-
structions.Asimplifieddiagramoftheprocessorpipelineis
Fig.2.Microprocessorpipelineorganization.
showninFig.2,wherethestateboundariesareindicatedby
gray.Featuresthatallowthemicroarchitecturetoachievehigh
speedareasfollows.
TheshifterandALUresideinseparatestages.TheARMin-
structionsetallowsashiftfollowedbyanALUoperationina
singleinstruction.Previousimplementationslimitedfrequency
byhavingtheshiftandALUinasinglestage.Splittingthisop-
erationreducesthecriticalALUbypasspathbyapproximately
1/3.Theextrapipelinehazardintroducedwhenaninstructionis
immediatelyfollowedbyonerequiringthattheresultbeshifted
isinfrequent.
DecoupledInstructionFetch.Atwo-instructiondeepqueueis
implementedbetweenthesecondfetchandinstructiondecode
pipestages.Thisallowsstallsgeneratedlaterinthepipetobe
deferredbyoneormorecyclesintheearlierpipestages,thereby
allowinginstructionfetchestoproceedwhenthepipeisstalled,
andalsorelievesstallspeedpathsintheinstructionfetchand
branchpredictionunits.
Deferredregisterdependency
stalls.Whileregisterdepen-
denciesarecheckedintheRFstage,stallsduetothesehazards
aredeferreduntiltheX1stage.Allthenecessaryoperandsare
thencapturedfromresult-forwardingbussesastheresultsare
returnedtotheregisterfile.
Oneofthemajorgoalsofthedesignwastominimizetheen-
ergyconsumedtocompleteagiventask.Conventionalwisdom
hasbeenthatshorterpipelinesaremoreefficientduetore-
LockBits
WE
wsrs
wdrd
1
1
5
5
Dedicated write ports are needed to avoid structural hazards.
11Tuesday, March 11, 14
UC Regents Fall 2008 © UCBCS 194-6 L9: Advanced Processors I
Memory semantics and lock-free caches
From CPU To CPU
Queue 1 Queue 2
The CPU expects that loads and stores to the same memory location are applied in queued order.
The simple (low-performance) approach for the data cache is to “snoop” Queue 1, and delay
accepting writes to addresses that are being read.Finally, note the lack of sequential consistency.
12Tuesday, March 11, 14
UC Regents Fall 2008 © UCBCS 194-6 L9: Advanced Processors I
Topic #2: Pipelines and latency ...This pipeline splits after the RF stage,
feeding functional units with different latencies.
1600
IEEEJOURNALOFSOLID-STATECIRCUITS,VOL.36,NO.11,NOVEMBER2001
Fig.1.ProcessSEMcrosssection.
Theprocess
wasraisedfrom[1]tolimitstandbypower.
Circuitdesignandarchitecturalpipeliningensurelowvoltage
performanceandfunctionality.Tofurtherlimitstandbycurrent
inhandheldASSPs,alongerpolytargettakesadvantageofthe
versus
dependenceandsource-to-bodybiasisused
toelectricallylimittransistor
instandbymode.Allcore
nMOSandpMOStransistorsutilizeseparatesourceandbulk
connectionstosupportthis.Theprocessincludescobaltdisili-
cidegatesanddiffusions.Lowsourceanddraincapacitance,as
wellas3-nmgate-oxidethickness,allowhighperformanceand
low-voltageoperation. III.ARCHITECTURE
Themicroprocessorcontains32-kBinstructionanddata
cachesaswellasaneight-entrycoalescingwritebackbuffer.
Theinstructionanddatacachefillbuffershavetwoandfour
entries,respectively.Thedatacachesupportshit-under-miss
operationandlinesmaybelockedtoallowSRAM-likeoper-
ation.Thirty-two-entryfullyassociativetranslationlookaside
buffers(TLBs)thatsupportmultiplepagesizesareprovided
forbothcaches.TLBentriesmayalsobelocked.A128-entry
branchtargetbufferimprovesbranchperformanceapipeline
deeperthanearlierhigh-performanceARMdesigns[2],[3].
A.PipelineOrganization
Toobtainhighperformance,themicroprocessorcoreutilizes
asimplescalarpipelineandahigh-frequencyclock.Inaddition
toavoidingthepotentialpowerwasteofasuperscalarapproach,
functionaldesignandvalidationcomplexityisdecreasedatthe
expenseofcircuitdesigneffort.Toavoidcircuitdesignissues,
thepipelinepartitioningbalancestheworkloadandensuresthat
noonepipelinestageistight.Themainintegerpipelineisseven
stages,memoryoperationsfollowaneight-stagepipeline,and
whenoperatinginthumbmodeanextrapipestageisinserted
afterthelastfetchstagetoconvertthumbinstructionsintoARM
instructions.Sincethumbmodeinstructions[11]are16b,two
instructionsarefetchedinparallelwhileexecutingthumbin-
structions.Asimplifieddiagramoftheprocessorpipelineis
Fig.2.Microprocessorpipelineorganization.
showninFig.2,wherethestateboundariesareindicatedby
gray.Featuresthatallowthemicroarchitecturetoachievehigh
speedareasfollows.
TheshifterandALUresideinseparatestages.TheARMin-
structionsetallowsashiftfollowedbyanALUoperationina
singleinstruction.Previousimplementationslimitedfrequency
byhavingtheshiftandALUinasinglestage.Splittingthisop-
erationreducesthecriticalALUbypasspathbyapproximately
1/3.Theextrapipelinehazardintroducedwhenaninstructionis
immediatelyfollowedbyonerequiringthattheresultbeshifted
isinfrequent.
DecoupledInstructionFetch.Atwo-instructiondeepqueueis
implementedbetweenthesecondfetchandinstructiondecode
pipestages.Thisallowsstallsgeneratedlaterinthepipetobe
deferredbyoneormorecyclesintheearlierpipestages,thereby
allowinginstructionfetchestoproceedwhenthepipeisstalled,
andalsorelievesstallspeedpathsintheinstructionfetchand
branchpredictionunits.
Deferredregisterdependency
stalls.Whileregisterdepen-
denciesarecheckedintheRFstage,stallsduetothesehazards
aredeferreduntiltheX1stage.Allthenecessaryoperandsare
thencapturedfromresult-forwardingbussesastheresultsare
returnedtotheregisterfile.
Oneofthemajorgoalsofthedesignwastominimizetheen-
ergyconsumedtocompleteagiventask.Conventionalwisdom
hasbeenthatshorterpipelinesaremoreefficientduetore-
13Tuesday, March 11, 14
UC Regents Fall 2008 © UCBCS 194-6 L9: Advanced Processors I
Split pipelines: a write-after-write hazard.
The pipeline splits after the RF stage, feeding functional units
with different latencies.
1600
IEEEJOURNALOFSOLID-STATECIRCUITS,VOL.36,NO.11,NOVEMBER2001
Fig.1.ProcessSEMcrosssection.
Theprocess
wasraisedfrom[1]tolimitstandbypower.
Circuitdesignandarchitecturalpipeliningensurelowvoltage
performanceandfunctionality.Tofurtherlimitstandbycurrent
inhandheldASSPs,alongerpolytargettakesadvantageofthe
versus
dependenceandsource-to-bodybiasisused
toelectricallylimittransistor
instandbymode.Allcore
nMOSandpMOStransistorsutilizeseparatesourceandbulk
connectionstosupportthis.Theprocessincludescobaltdisili-
cidegatesanddiffusions.Lowsourceanddraincapacitance,as
wellas3-nmgate-oxidethickness,allowhighperformanceand
low-voltageoperation. III.ARCHITECTURE
Themicroprocessorcontains32-kBinstructionanddata
cachesaswellasaneight-entrycoalescingwritebackbuffer.
Theinstructionanddatacachefillbuffershavetwoandfour
entries,respectively.Thedatacachesupportshit-under-miss
operationandlinesmaybelockedtoallowSRAM-likeoper-
ation.Thirty-two-entryfullyassociativetranslationlookaside
buffers(TLBs)thatsupportmultiplepagesizesareprovided
forbothcaches.TLBentriesmayalsobelocked.A128-entry
branchtargetbufferimprovesbranchperformanceapipeline
deeperthanearlierhigh-performanceARMdesigns[2],[3].
A.PipelineOrganization
Toobtainhighperformance,themicroprocessorcoreutilizes
asimplescalarpipelineandahigh-frequencyclock.Inaddition
toavoidingthepotentialpowerwasteofasuperscalarapproach,
functionaldesignandvalidationcomplexityisdecreasedatthe
expenseofcircuitdesigneffort.Toavoidcircuitdesignissues,
thepipelinepartitioningbalancestheworkloadandensuresthat
noonepipelinestageistight.Themainintegerpipelineisseven
stages,memoryoperationsfollowaneight-stagepipeline,and
whenoperatinginthumbmodeanextrapipestageisinserted
afterthelastfetchstagetoconvertthumbinstructionsintoARM
instructions.Sincethumbmodeinstructions[11]are16b,two
instructionsarefetchedinparallelwhileexecutingthumbin-
structions.Asimplifieddiagramoftheprocessorpipelineis
Fig.2.Microprocessorpipelineorganization.
showninFig.2,wherethestateboundariesareindicatedby
gray.Featuresthatallowthemicroarchitecturetoachievehigh
speedareasfollows.
TheshifterandALUresideinseparatestages.TheARMin-
structionsetallowsashiftfollowedbyanALUoperationina
singleinstruction.Previousimplementationslimitedfrequency
byhavingtheshiftandALUinasinglestage.Splittingthisop-
erationreducesthecriticalALUbypasspathbyapproximately
1/3.Theextrapipelinehazardintroducedwhenaninstructionis
immediatelyfollowedbyonerequiringthattheresultbeshifted
isinfrequent.
DecoupledInstructionFetch.Atwo-instructiondeepqueueis
implementedbetweenthesecondfetchandinstructiondecode
pipestages.Thisallowsstallsgeneratedlaterinthepipetobe
deferredbyoneormorecyclesintheearlierpipestages,thereby
allowinginstructionfetchestoproceedwhenthepipeisstalled,
andalsorelievesstallspeedpathsintheinstructionfetchand
branchpredictionunits.
Deferredregisterdependency
stalls.Whileregisterdepen-
denciesarecheckedintheRFstage,stallsduetothesehazards
aredeferreduntiltheX1stage.Allthenecessaryoperandsare
thencapturedfromresult-forwardingbussesastheresultsare
returnedtotheregisterfile.
Oneofthemajorgoalsofthedesignwastominimizetheen-
ergyconsumedtocompleteagiventask.Conventionalwisdom
hasbeenthatshorterpipelinesaremoreefficientduetore-
WAW Hazard
SUB R1, R2, R3DIV R1, R2, R3
If long latency DIV and short latency SUB are sent to
parallel pipes, SUB may finish first.
Solution: SUB detects R1 clash in decode stage and stalls, via a pipe-write scoreboard.
14Tuesday, March 11, 14
UC Regents Fall 2008 © UCBCS 194-6 L9: Advanced Processors I
Register write port: a structural hazard
Other solutions possible ... above, solution of separate write ports.
1600
IEEEJOURNALOFSOLID-STATECIRCUITS,VOL.36,NO.11,NOVEMBER2001
Fig.1.ProcessSEMcrosssection.
Theprocess
wasraisedfrom[1]tolimitstandbypower.
Circuitdesignandarchitecturalpipeliningensurelowvoltage
performanceandfunctionality.Tofurtherlimitstandbycurrent
inhandheldASSPs,alongerpolytargettakesadvantageofthe
versus
dependenceandsource-to-bodybiasisused
toelectricallylimittransistor
instandbymode.Allcore
nMOSandpMOStransistorsutilizeseparatesourceandbulk
connectionstosupportthis.Theprocessincludescobaltdisili-
cidegatesanddiffusions.Lowsourceanddraincapacitance,as
wellas3-nmgate-oxidethickness,allowhighperformanceand
low-voltageoperation. III.ARCHITECTURE
Themicroprocessorcontains32-kBinstructionanddata
cachesaswellasaneight-entrycoalescingwritebackbuffer.
Theinstructionanddatacachefillbuffershavetwoandfour
entries,respectively.Thedatacachesupportshit-under-miss
operationandlinesmaybelockedtoallowSRAM-likeoper-
ation.Thirty-two-entryfullyassociativetranslationlookaside
buffers(TLBs)thatsupportmultiplepagesizesareprovided
forbothcaches.TLBentriesmayalsobelocked.A128-entry
branchtargetbufferimprovesbranchperformanceapipeline
deeperthanearlierhigh-performanceARMdesigns[2],[3].
A.PipelineOrganization
Toobtainhighperformance,themicroprocessorcoreutilizes
asimplescalarpipelineandahigh-frequencyclock.Inaddition
toavoidingthepotentialpowerwasteofasuperscalarapproach,
functionaldesignandvalidationcomplexityisdecreasedatthe
expenseofcircuitdesigneffort.Toavoidcircuitdesignissues,
thepipelinepartitioningbalancestheworkloadandensuresthat
noonepipelinestageistight.Themainintegerpipelineisseven
stages,memoryoperationsfollowaneight-stagepipeline,and
whenoperatinginthumbmodeanextrapipestageisinserted
afterthelastfetchstagetoconvertthumbinstructionsintoARM
instructions.Sincethumbmodeinstructions[11]are16b,two
instructionsarefetchedinparallelwhileexecutingthumbin-
structions.Asimplifieddiagramoftheprocessorpipelineis
Fig.2.Microprocessorpipelineorganization.
showninFig.2,wherethestateboundariesareindicatedby
gray.Featuresthatallowthemicroarchitecturetoachievehigh
speedareasfollows.
TheshifterandALUresideinseparatestages.TheARMin-
structionsetallowsashiftfollowedbyanALUoperationina
singleinstruction.Previousimplementationslimitedfrequency
byhavingtheshiftandALUinasinglestage.Splittingthisop-
erationreducesthecriticalALUbypasspathbyapproximately
1/3.Theextrapipelinehazardintroducedwhenaninstructionis
immediatelyfollowedbyonerequiringthattheresultbeshifted
isinfrequent.
DecoupledInstructionFetch.Atwo-instructiondeepqueueis
implementedbetweenthesecondfetchandinstructiondecode
pipestages.Thisallowsstallsgeneratedlaterinthepipetobe
deferredbyoneormorecyclesintheearlierpipestages,thereby
allowinginstructionfetchestoproceedwhenthepipeisstalled,
andalsorelievesstallspeedpathsintheinstructionfetchand
branchpredictionunits.
Deferredregisterdependency
stalls.Whileregisterdepen-
denciesarecheckedintheRFstage,stallsduetothesehazards
aredeferreduntiltheX1stage.Allthenecessaryoperandsare
thencapturedfromresult-forwardingbussesastheresultsare
returnedtotheregisterfile.
Oneofthemajorgoalsofthedesignwastominimizetheen-
ergyconsumedtocompleteagiventask.Conventionalwisdom
hasbeenthatshorterpipelinesaremoreefficientduetore-
Structural Hazard
[...]DIV R1, R2, R3
DIV and SUB may need to write register file at the same time.
SUB R5, R2, R3
Solution: A scoreboard structure to reserve future slots of the write port. Stall SUB in decode until slot opens.
15Tuesday, March 11, 14
UC Regents Fall 2008 © UCBCS 194-6 L9: Advanced Processors I
Functional unit input: a structural hazard
The pipeline splits after the RF stage, feeding functional units
with different latencies.
1600
IEEEJOURNALOFSOLID-STATECIRCUITS,VOL.36,NO.11,NOVEMBER2001
Fig.1.ProcessSEMcrosssection.
Theprocess
wasraisedfrom[1]tolimitstandbypower.
Circuitdesignandarchitecturalpipeliningensurelowvoltage
performanceandfunctionality.Tofurtherlimitstandbycurrent
inhandheldASSPs,alongerpolytargettakesadvantageofthe
versus
dependenceandsource-to-bodybiasisused
toelectricallylimittransistor
instandbymode.Allcore
nMOSandpMOStransistorsutilizeseparatesourceandbulk
connectionstosupportthis.Theprocessincludescobaltdisili-
cidegatesanddiffusions.Lowsourceanddraincapacitance,as
wellas3-nmgate-oxidethickness,allowhighperformanceand
low-voltageoperation. III.ARCHITECTURE
Themicroprocessorcontains32-kBinstructionanddata
cachesaswellasaneight-entrycoalescingwritebackbuffer.
Theinstructionanddatacachefillbuffershavetwoandfour
entries,respectively.Thedatacachesupportshit-under-miss
operationandlinesmaybelockedtoallowSRAM-likeoper-
ation.Thirty-two-entryfullyassociativetranslationlookaside
buffers(TLBs)thatsupportmultiplepagesizesareprovided
forbothcaches.TLBentriesmayalsobelocked.A128-entry
branchtargetbufferimprovesbranchperformanceapipeline
deeperthanearlierhigh-performanceARMdesigns[2],[3].
A.PipelineOrganization
Toobtainhighperformance,themicroprocessorcoreutilizes
asimplescalarpipelineandahigh-frequencyclock.Inaddition
toavoidingthepotentialpowerwasteofasuperscalarapproach,
functionaldesignandvalidationcomplexityisdecreasedatthe
expenseofcircuitdesigneffort.Toavoidcircuitdesignissues,
thepipelinepartitioningbalancestheworkloadandensuresthat
noonepipelinestageistight.Themainintegerpipelineisseven
stages,memoryoperationsfollowaneight-stagepipeline,and
whenoperatinginthumbmodeanextrapipestageisinserted
afterthelastfetchstagetoconvertthumbinstructionsintoARM
instructions.Sincethumbmodeinstructions[11]are16b,two
instructionsarefetchedinparallelwhileexecutingthumbin-
structions.Asimplifieddiagramoftheprocessorpipelineis
Fig.2.Microprocessorpipelineorganization.
showninFig.2,wherethestateboundariesareindicatedby
gray.Featuresthatallowthemicroarchitecturetoachievehigh
speedareasfollows.
TheshifterandALUresideinseparatestages.TheARMin-
structionsetallowsashiftfollowedbyanALUoperationina
singleinstruction.Previousimplementationslimitedfrequency
byhavingtheshiftandALUinasinglestage.Splittingthisop-
erationreducesthecriticalALUbypasspathbyapproximately
1/3.Theextrapipelinehazardintroducedwhenaninstructionis
immediatelyfollowedbyonerequiringthattheresultbeshifted
isinfrequent.
DecoupledInstructionFetch.Atwo-instructiondeepqueueis
implementedbetweenthesecondfetchandinstructiondecode
pipestages.Thisallowsstallsgeneratedlaterinthepipetobe
deferredbyoneormorecyclesintheearlierpipestages,thereby
allowinginstructionfetchestoproceedwhenthepipeisstalled,
andalsorelievesstallspeedpathsintheinstructionfetchand
branchpredictionunits.
Deferredregisterdependency
stalls.Whileregisterdepen-
denciesarecheckedintheRFstage,stallsduetothesehazards
aredeferreduntiltheX1stage.Allthenecessaryoperandsare
thencapturedfromresult-forwardingbussesastheresultsare
returnedtotheregisterfile.
Oneofthemajorgoalsofthedesignwastominimizetheen-
ergyconsumedtocompleteagiventask.Conventionalwisdom
hasbeenthatshorterpipelinesaremoreefficientduetore-
DIV R1, R2, R3
Structural Hazard
Divide is usually not fully pipelined, and cannot accept new
operands every cycle.
DIV R5, R2, R3
Solution: A scoreboard structure to detect busy functional units. Stall DIV R5, ... in decode until divider is ready.
16Tuesday, March 11, 14
UC Regents Fall 2008 © UCBCS 194-6 L9: Advanced Processors I
Imprecise exceptions: A difficult issue
The pipeline splits after the RF stage, feeding functional units
with different latencies.
1600
IEEEJOURNALOFSOLID-STATECIRCUITS,VOL.36,NO.11,NOVEMBER2001
Fig.1.ProcessSEMcrosssection.
Theprocess
wasraisedfrom[1]tolimitstandbypower.
Circuitdesignandarchitecturalpipeliningensurelowvoltage
performanceandfunctionality.Tofurtherlimitstandbycurrent
inhandheldASSPs,alongerpolytargettakesadvantageofthe
versus
dependenceandsource-to-bodybiasisused
toelectricallylimittransistor
instandbymode.Allcore
nMOSandpMOStransistorsutilizeseparatesourceandbulk
connectionstosupportthis.Theprocessincludescobaltdisili-
cidegatesanddiffusions.Lowsourceanddraincapacitance,as
wellas3-nmgate-oxidethickness,allowhighperformanceand
low-voltageoperation. III.ARCHITECTURE
Themicroprocessorcontains32-kBinstructionanddata
cachesaswellasaneight-entrycoalescingwritebackbuffer.
Theinstructionanddatacachefillbuffershavetwoandfour
entries,respectively.Thedatacachesupportshit-under-miss
operationandlinesmaybelockedtoallowSRAM-likeoper-
ation.Thirty-two-entryfullyassociativetranslationlookaside
buffers(TLBs)thatsupportmultiplepagesizesareprovided
forbothcaches.TLBentriesmayalsobelocked.A128-entry
branchtargetbufferimprovesbranchperformanceapipeline
deeperthanearlierhigh-performanceARMdesigns[2],[3].
A.PipelineOrganization
Toobtainhighperformance,themicroprocessorcoreutilizes
asimplescalarpipelineandahigh-frequencyclock.Inaddition
toavoidingthepotentialpowerwasteofasuperscalarapproach,
functionaldesignandvalidationcomplexityisdecreasedatthe
expenseofcircuitdesigneffort.Toavoidcircuitdesignissues,
thepipelinepartitioningbalancestheworkloadandensuresthat
noonepipelinestageistight.Themainintegerpipelineisseven
stages,memoryoperationsfollowaneight-stagepipeline,and
whenoperatinginthumbmodeanextrapipestageisinserted
afterthelastfetchstagetoconvertthumbinstructionsintoARM
instructions.Sincethumbmodeinstructions[11]are16b,two
instructionsarefetchedinparallelwhileexecutingthumbin-
structions.Asimplifieddiagramoftheprocessorpipelineis
Fig.2.Microprocessorpipelineorganization.
showninFig.2,wherethestateboundariesareindicatedby
gray.Featuresthatallowthemicroarchitecturetoachievehigh
speedareasfollows.
TheshifterandALUresideinseparatestages.TheARMin-
structionsetallowsashiftfollowedbyanALUoperationina
singleinstruction.Previousimplementationslimitedfrequency
byhavingtheshiftandALUinasinglestage.Splittingthisop-
erationreducesthecriticalALUbypasspathbyapproximately
1/3.Theextrapipelinehazardintroducedwhenaninstructionis
immediatelyfollowedbyonerequiringthattheresultbeshifted
isinfrequent.
DecoupledInstructionFetch.Atwo-instructiondeepqueueis
implementedbetweenthesecondfetchandinstructiondecode
pipestages.Thisallowsstallsgeneratedlaterinthepipetobe
deferredbyoneormorecyclesintheearlierpipestages,thereby
allowinginstructionfetchestoproceedwhenthepipeisstalled,
andalsorelievesstallspeedpathsintheinstructionfetchand
branchpredictionunits.
Deferredregisterdependency
stalls.Whileregisterdepen-
denciesarecheckedintheRFstage,stallsduetothesehazards
aredeferreduntiltheX1stage.Allthenecessaryoperandsare
thencapturedfromresult-forwardingbussesastheresultsare
returnedtotheregisterfile.
Oneofthemajorgoalsofthedesignwastominimizetheen-
ergyconsumedtocompleteagiventask.Conventionalwisdom
hasbeenthatshorterpipelinesaremoreefficientduetore-
Exceptions
SUB R4, R2, R3DIV R1, R2, R3
If DIV throws an exception after SUB
writes back, the contract with the
programmer breaks.
Solutions: Too complicated for a slide. See page C-58 in CA-AQA
17Tuesday, March 11, 14
UC Regents Fall 2008 © UCBCS 194-6 L9: Advanced Processors I
Seconds
Program
Instructions
Program= Seconds
Cycle Instruction
Cycles
Goal: Improve CPI by issuing several instructions per cycle.
Difficulties: Load and branchdelays affect more instructions.Ultimate Limiter: Programs maybe a poor match to issue rules.
Page 3
Krste
March
10, 2
004
6.8
23, L
11--5
Fu
nc
tion
Un
it Ch
ara
cte
ris
tics
bu
sy
bu
sy
2 c
yc
2 c
yc
1c
yc
1c
yc
1c
yc
fully
pipelined
partially
pipelined
Fu
nctio
n u
nits
have in
tern
al p
ipelin
e r
eg
iste
rs
!""o
peran
ds a
re la
tch
ed
wh
en
an
instr
uctio
n
en
ters a
fun
ctio
n u
nit
!""in
pu
ts to
a fu
nctio
n u
nit (e
.g., r
eg
iste
r file
)
can
ch
an
ge d
urin
g a
lon
g la
ten
cy o
peratio
n
ac
ce
pt
ac
ce
pt
Krste
March
10, 2
004
6.8
23, L
11--6
Mu
ltiple
Fu
nctio
n U
nits
IFID
WB
AL
UM
em
Fad
d
Fm
ul
Fd
iv
Issu
e
GP
R’s
FP
R’s
Example: CPU with floating point ALUs: Issue 1 FP + 1 Integer instruction per cycle.
Superscalar: Multiple issues per cycle
18Tuesday, March 11, 14
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Recall VLIW: Super-sized Instructions
Example: All instructions are 64-bit. Each instruction consists of two 32-bit MIPS instructions, that execute in parallel.
opcode rs rt rd functshamt
opcode rs rt rd functshamt
Syntax: ADD $8 $9 $10 Semantics:$8 = $9 + $10
Syntax: ADD $7 $8 $9 Semantics:$7 = $8 + $9
A 64-bit VLIW instruction
But what if we can’t change ISA execution semantics ?
19Tuesday, March 11, 14
UC Regents Fall 2008 © UCBCS 194-6 L9: Advanced Processors I
IR IR
IF (Fetch) ID (Decode) EX (ALU)
IR IR
MEM WB
IR IR
IF (Fetch) ID (Decode) EX (ALU)
IR IR
MEM WB
rd1
RegFile
rd2
WE1
wd1
rs1
rs2
ws1
WE2
rd3
rd4
rs3
rs4
wd2
ws2
A
B
A
B
32A
L
U
32
32
op
Y
32A
L
U
32
32
op
Y
R
R
Superscalar R machine
Addr
DataInstrMem
64
32PC and
Sequencer
Instruction Issue Logic
20Tuesday, March 11, 14
UC Regents Fall 2008 © UCBCS 194-6 L9: Advanced Processors I
IR IR
IF (Fetch) ID (Decode) EX (ALU)
IR IR
MEM WB
IR IR
ID (Decode) EX (ALU)
IR IR
MEM WB
rd1
RegFile
rd2
WE1
wd1
rs1
rs2
ws1
WE2
rd3
rd4
rs3
rs4
wd2
ws2
A
B
A
B
32A
L
U
32
32
op
Y
32A
L
U
32
32
op
Y
R
R
Sustaining Dual Instr Issues
(no forwarding)
ADD R21,R20,R19ADD R24,R23,R22
ADD R21,R20,R19
ADD R24,R23,R22
ADD R15,R14,R13ADD R18,R17,R16
ADD R15,R14,R13
ADD R18,R17,R16
ADD R27,R26,R25ADD R30,R29,R28
ADD R27
ADD R30
ADD R9,R8,R7
ADD R12,R11,R10
ADD R9,R8,R7ADD R12,R11,R10
ADD R8,R0,R0ADD R11,R0,R0
It’s rarely this good ...
21Tuesday, March 11, 14
UC Regents Fall 2008 © UCBCS 194-6 L9: Advanced Processors I
IR IR
IF (Fetch) ID (Decode) EX (ALU)
IR IR
MEM WB
IR IR
ID (Decode) EX (ALU)
IR IR
MEM WB
rd1
RegFile
rd2
WE1
wd1
rs1
rs2
ws1
WE2
rd3
rd4
rs3
rs4
wd2
ws2
A
B
A
B
32A
L
U
32
32
op
Y
32A
L
U
32
32
op
Y
R
R
Worst-Case Instruction Issue
NOP
ADD R8,
ADD R8,R0,R0
ADD R9,R8,R0
ADD R9,R8,R0
ADD R10,R9,R0
ADD R10,R9,R0
ADD R11,R10,R0
ADD R11,R10,R0
NOP NOP NOP
Dependencies force
“serialization”
We add 12 forwarding buses (not shown).(6 to each ID from stages of both pipes).
22Tuesday, March 11, 14
UC Regents Fall 2008 © UCBCS 194-6 L9: Advanced Processors I
Page 3
Krste
March 10, 2004
6.823, L11--5
Function Unit Characteristics
busy
busy
2 cyc 2 cyc
1cyc 1cyc 1cyc
fully
pipelined
partially
pipelined
Function units have internal pipeline registers
!"" operands are latched when an instruction
enters a function unit
!"" inputs to a function unit (e.g., register file)
can change during a long latency operation
accept
accept
Krste
March 10, 2004
6.823, L11--6
Multiple Function Units
IF ID WB
ALU Mem
Fadd
Fmul
Fdiv
Issue
GPR’s
FPR’s
Example: Superscalar MIPS. Fetches 2 instructions at a time. If first integer and second floating point, issue in same cycle
Superscalar: A simple example ...
Integer instruction FP instruction
LD F0,0(R1) LD F6,-8(R1) LD F10,-16(R1) ADDD F4,F0,F2 LD F14,-24(R1) ADDD F8,F6,F2 LD F18,-32(R1) ADDD F12,F10,F2 SD 0(R1),F4 ADDD F16,F14,F2 SD -8(R1),F8 ADDD F20,F18,F2 SD -16(R1),F12 SD -24(R1),F16
Two issuesper cycle
One issueper cycle
Why is the control for this CPU not so hard to do?
23Tuesday, March 11, 14
UC Regents Fall 2008 © UCBCS 194-6 L9: Advanced Processors I
Page 3
Krste
March 10, 2004
6.823, L11--5
Function Unit Characteristics
busy
busy
2 cyc 2 cyc
1cyc 1cyc 1cyc
fully
pipelined
partially
pipelined
Function units have internal pipeline registers
!"" operands are latched when an instruction
enters a function unit
!"" inputs to a function unit (e.g., register file)
can change during a long latency operation
accept
accept
Krste
March 10, 2004
6.823, L11--6
Multiple Function Units
IF ID WB
ALU Mem
Fadd
Fmul
Fdiv
Issue
GPR’s
FPR’s
Three instructions potentially affected by a single cycle of load delay, as FP register loads done in the “integer” pipeline).
Superscalar: Visualizing the pipeline
Type Pipe Stages Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB
24Tuesday, March 11, 14
UC Regents Fall 2008 © UCBCS 194-6 L9: Advanced Processors I
Limitations of “lockstep” superscalarGets 0.5 CPI only for a 50/50 float/int mix with no hazards. For games/media, may be OK.Extending scheme to speed up general apps (Microsoft Office, ...) is complicated.If one accepts building a complicated machine, there are better ways to do it.
The Power5 scans fetched instructions forbranches (BP stage), and if it finds a branch,predicts the branch direction using threebranch history tables shared by the twothreads. Two of the BHTs use bimodal andpath-correlated branch prediction mecha-nisms to predict branch directions.6,7 Thethird BHT predicts which of these predictionmechanisms is more likely to predict the cor-
rect direction.7 If the fetched instructions con-tain multiple branches, the BP stage can pre-dict all the branches at the same time. Inaddition to predicting direction, the Power5also predicts the target of a taken branch inthe current cycle’s eight-instruction group. Inthe PowerPC architecture, the processor cancalculate the target of most branches from theinstruction’s address and offset value. For
43MARCH–APRIL 2004
MP ISS RF EA DC WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF
XferF6
Group formation andinstruction decode
Instruction fetch
Branch redirects
Interrupts and flushes
WB
Fmt
D1 D2 D3 Xfer GD
BPICCP
D0
IF
Branchpipeline
Load/storepipeline
Fixed-pointpipeline
Floating-point pipeline
Out-of-order processing
Figure 3. Power5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA =compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, andCP = group commit).
Shared by two threads Thread 0 resources Thread 1 resources
LSU0FXU0
LSU1
FXU1
FPU0
FPU1
BXU
CRL
Dynamicinstructionselection
Threadpriority
Group formationInstruction decode
Dispatch
Shared-register
mappers
Readshared-
register files
Sharedissue
queues
Sharedexecution
units
Alternate
Branch prediction
Instructioncache
Instructiontranslation
Programcounter
Branchhistorytables
Returnstack
Targetcache
DataCache
DataTranslation
L2cache
Datacache
Datatranslation
Instructionbuffer 0
Instructionbuffer 1
Writeshared-
register files
Groupcompletion
Storequeue
Figure 4. Power5 instruction data flow (BXU = branch execution unit and CRL = condition register logical execution unit).
DynamicScheduling:
After spring break.
25Tuesday, March 11, 14
120D
igital Technical JournalVol. 7
No. 1
1995
INSTRUCTION STREAM FILL REFILL
BUFFER
NEXT INDEX LOGIC
8-KB, 32-BYTE BLOCK, DIRECT-MAPPED INSTRUCTION CACHE
INSTRUCTION CACHE ADDRESS LOGIC
INSTRUCTION TRANSLATION BUFFER
48-ENTRY ASSOCIATIVE
0
1
INSTRUCTION BUFFER
INSTRUCTION SLOT LOGIC
INTEGER REGISTER FILE
ISSUE SCOREBOARD LOGIC
PIPELINE STAGESS–1 S0 S1 S2 S3 S4
LOAD DATA
FLOATING- POINT REGISTER FILE
INTEGER MULTIPLIER
INTEGER PIPE 0ADD, LOG, SHIFT, LD, ST, IMUL, CMP, CMOV, BYTE, WORD
INTEGER PIPE 1 ADD, LOG, LD, BR, CMP, CMOV
FLOATING-POINT STORE DATAINTEGER UNIT STORE DATA
8-KB, 32-BYTE BLOCK, DIRECT-MAPPED, DUAL READ-PORTED DATA CACHE (D-CACHE)
DUAL-READ TRANSLATION BUFFER
64-ENTRY ASSOCIATIVE DUAL-PORTED
STORE AND FILL DATA
FLOATING- POINT DIVIDER
FLOATING-POINT ADD PIPE AND DIVIDER
FLOATING-POINT MULTIPLY PIPE
MISS ADDRESS FILE
6 DATA MISSES
4 INSTRUCTION STREAM MISSES
WRITE BUFFER
SIX 32-BYTE ENTRIESSTORE
DATA
S5 S6
96-KB, 64-BYTE BLOCK, 3-WAY, SET-ASSOCIATIVE SECOND-LEVEL CACHE (S-CACHE)
ADDRESS TO PINS
S7
BUS ADDRESS FILE
TWO ENTRIES
DATA FROM PINS
INSTRUCTION AND DATA FILLS
1-MB TO 64-MB DIRECT-MAPPED BACKUP CACHE (B-CACHE)
CACHE CONTROL AND BUS INTERFACE UNiT
S8S9
MEMORY ADDRESS TRANSLATION UNIT
INTEGER EXECUTION UNIT
TO FLOATING-POINT UNIT
FLOATING-POINT EXECUTION UNIT
INSTRUCTION FETCH/DECODE UNIT
INSTRUCTION STREAM MISS (PHYSICAL ADDRESS)
Figure 1Five Functional Units on the Alpha 21164 Microprocessor
DEC Alpha 21164
This 4-issue chip was the high watermark for in-ordersuperscalar designs.
26Tuesday, March 11, 14
RISC versus CISC: A Tale of Two ChipsDileep BhandarkarIntel Corporation
Santa Clara, California, USA
AbstractThis paper compares an aggressive RISC and CISCimplementation built with comparable technology.The two chips are the Alpha* 21164 and the IntelPentium® Pro processor. The paper presentsperformance comparisons for industry standardbenchmarks and uses performance counter statisticsto compare various aspects of both designs.
IntroductionIn 1991, Bhandarkar and Clark published a papercomparing an example implementation from the RISCand CISC architectural schools (a MIPS* M/2000 and aDigital VAX* 8700) on nine of the ten SPEC89benchmarks. The organizational similarity of thesemachines provided an opportunity to examine thepurely architectural advantages of RISC. That papershowed that the resulting advantage in cycles perprogram ranged from slightly under a factor of 2 toalmost a factor of 4, with a geometric mean of 2.7. Thispaper attempts yet another comparison of a leadingRISC and CISC implementation, but using chipsdesigned with comparable semiconductor technology.The RISC chip chosen for this study is the DigitalAlpha 21164 [Edmondson95]. The CISC chip is theIntel Pentium® Pro processor [Colwell95]. The resultsshould not be used to draw sweeping conclusions aboutRISC and CISC in general. They should be viewed as asnapshot in time. Note that performance is alsodetermined by the system platform and compiler used.
Chip Overview
Table 1 shows the major characteristics of the twochips. Both chips are implemented in around 0.5µtechnology and the die size is comparable. The designapproach is quite different, but both represent state ofthe art implementations that achieved the highestperformance for RISC and CISC architecturesrespectively at the time of their introduction.
Table 1 Chip Comparison
Alpha21164
Pentium® ProProcessor
Architecture Alpha IA-32Clock Speed 300 MHz 150 MHzIssue Rate Four ThreeFunction Units four fiveOut of order issue no yesRename Registers none 40On-chip Cache 8 KB data
8KB instr96 KB Level 2
8 KB data8KB instr
Off chip cache 4 MB 256 KBBranch HistoryTable
2048 entries,2-bit history
512 entries,4-bit history
TransistorsLogicTotal
1.8 million9.3 million
4.5 million5.5 million
VLSI ProcessMin. GeometryMetal Layers
CMOS0.5 µ
4
BiCMOS0.6 µ
4Die Size 298 mm2 306 mm2
Package 499 pin PGA 387 pin PGAPower 50 W 20 W incl cacheFirst Silicon Feb. 94 4Q 94Volume Parts 1Q 95 4Q 95SPECint92/95 341/7.43 245/6.08SPECfp92/95 513/12.4 220/5.42SYSmark/NT 529 497
The 21164 is a quad issue superscalar design thatimplements two levels of caches on chip, but does notimplement out of order execution. The Pentium® Proprocessor implements dynamic execution using anout-of-order, speculative execution engine, withregister renaming of integer, floating point and flagsvariables. Consequently, even though the die size iscomparable, the total transistor count is quitedifferent for the two chips. The aggressive design ofthe Pentium Pro processor is much more logicintensive; and logic transistors are less dense. The on-chip 96 KB L2 cache of the 21164 inflates itstransistor count. Even though the Alpha 21164 has anon-chip L2 cache, most systems use a 2 or 4 MBboard level cache to achieve their performance goal.
RISC versus CISC: A Tale of Two ChipsDileep BhandarkarIntel Corporation
Santa Clara, California, USA
AbstractThis paper compares an aggressive RISC and CISCimplementation built with comparable technology.The two chips are the Alpha* 21164 and the IntelPentium® Pro processor. The paper presentsperformance comparisons for industry standardbenchmarks and uses performance counter statisticsto compare various aspects of both designs.
IntroductionIn 1991, Bhandarkar and Clark published a papercomparing an example implementation from the RISCand CISC architectural schools (a MIPS* M/2000 and aDigital VAX* 8700) on nine of the ten SPEC89benchmarks. The organizational similarity of thesemachines provided an opportunity to examine thepurely architectural advantages of RISC. That papershowed that the resulting advantage in cycles perprogram ranged from slightly under a factor of 2 toalmost a factor of 4, with a geometric mean of 2.7. Thispaper attempts yet another comparison of a leadingRISC and CISC implementation, but using chipsdesigned with comparable semiconductor technology.The RISC chip chosen for this study is the DigitalAlpha 21164 [Edmondson95]. The CISC chip is theIntel Pentium® Pro processor [Colwell95]. The resultsshould not be used to draw sweeping conclusions aboutRISC and CISC in general. They should be viewed as asnapshot in time. Note that performance is alsodetermined by the system platform and compiler used.
Chip Overview
Table 1 shows the major characteristics of the twochips. Both chips are implemented in around 0.5µtechnology and the die size is comparable. The designapproach is quite different, but both represent state ofthe art implementations that achieved the highestperformance for RISC and CISC architecturesrespectively at the time of their introduction.
Table 1 Chip Comparison
Alpha21164
Pentium® ProProcessor
Architecture Alpha IA-32Clock Speed 300 MHz 150 MHzIssue Rate Four ThreeFunction Units four fiveOut of order issue no yesRename Registers none 40On-chip Cache 8 KB data
8KB instr96 KB Level 2
8 KB data8KB instr
Off chip cache 4 MB 256 KBBranch HistoryTable
2048 entries,2-bit history
512 entries,4-bit history
TransistorsLogicTotal
1.8 million9.3 million
4.5 million5.5 million
VLSI ProcessMin. GeometryMetal Layers
CMOS0.5 µ
4
BiCMOS0.6 µ
4Die Size 298 mm2 306 mm2
Package 499 pin PGA 387 pin PGAPower 50 W 20 W incl cacheFirst Silicon Feb. 94 4Q 94Volume Parts 1Q 95 4Q 95SPECint92/95 341/7.43 245/6.08SPECfp92/95 513/12.4 220/5.42SYSmark/NT 529 497
The 21164 is a quad issue superscalar design thatimplements two levels of caches on chip, but does notimplement out of order execution. The Pentium® Proprocessor implements dynamic execution using anout-of-order, speculative execution engine, withregister renaming of integer, floating point and flagsvariables. Consequently, even though the die size iscomparable, the total transistor count is quitedifferent for the two chips. The aggressive design ofthe Pentium Pro processor is much more logicintensive; and logic transistors are less dense. The on-chip 96 KB L2 cache of the 21164 inflates itstransistor count. Even though the Alpha 21164 has anon-chip L2 cache, most systems use a 2 or 4 MBboard level cache to achieve their performance goal.
.
Alpha 21164 (12.1 SPECint95) and 200 MHzPentium® Pro (8.71 SPECint95) processors, circaOctober 1996. The results show that while the Alphasystem is 45% faster on SPECint_rate95, it is 8%slower on the TPC-C benchmark with a 59% higher$/tpmC using the same database software!
Table 4 TPC-C Performance
CompaqProLiant 5000Model 6/200
DigitalAlphaServer4100 5/400
CPUs Four 200 MHzPentium Proprocessors
Four 400 MHzAlpha 21164processors
L2 cache 512KB 4 MBSPECint_rate95 292 (est)2 422TPC-C perf 8311 tpmC @
$95.32/tpmC7598 tpmC @$152.04/tpmC
Operating Sys SCO UnixWare Digital UNIXDatabase Sybase SQL
Server 11.0Sybase SQLServer 11.0
Concluding RemarksStudies like this one offer some insight into theperformance characteristics of different instruction setarchitectures and attempts to implement them well ina comparable technology. The overall performance isaffected by many factors and strict cause-effectrelationships are hard to pinpoint. Such explorationsare also hindered by the lack of measured data oncommon workloads for systems designed by differentcompanies. This study would have been moremeaningful if more stressful environments like on-line transaction processing and computer aided designcould have been analyzed in detail. Nevertheless, itdoes provide new quantitative data, that can be usedto get a better understanding of the performancedifferences between a premier RISC and CISCimplementation.
Using a comparable die size, the Pentium® Proprocessor achieves 80 to 90% of the performance ofthe Alpha 21164 on integer benchmarks andtransaction processing workloads. It uses anaggressive out-of-order design to overcome theinstruction set level limitations of a CISCarchitecture. On floating-point intensive benchmarks,the Alpha 21164 does achieve over twice theperformance of the Pentium Pro processor.
2 measured result for Fujitsu ICL Superserver J654i using
the same processor.
AcknowledgmentsThe author is grateful to Jeff Reilly and Mason Guyof Intel for collecting the performance countermeasurement data for the Pentium® Pro processor,and Zarka Cvetanovic of Digital EquipmentCorporation for providing the performance countermeasurement data for the Alpha 21164.
References[Bannon95] P. Bannon and J. Keller, “InternalArchitecture of Alpha 21164 Microprocessor”, Proc.Compcon Spring 95, Mar 1995.
[Bhandarkar91] D. Bhandarkar and D. Clark,“Performance from Architecture: Comparing a RISCand a CISC with Similar Hardware Organization,”Proceedings of ASPLOS-IV, April 1991.
[Bhandarkar95] D. Bhandarkar, “AlphaImplementations and Architecture: CompleteReference and Guide”, 1995, ISBN: 1-55558-130-7,Digital Press, Newton, MA.
[Bhandarkar97] D. Bhandarkar and J. Ding,“Performance Characterization of the Pentium ProProcessor,” Proceedings of HPCA-3, February 1997.
[Colwell95] R. Colwell and R. Steck, “A 0.6umBiCMOS Processor with Dynamic Execution”,ISSCC Proceedings, pp 176-177, February 1995.
[Cvetanovic96] Z. Cvetanovic and D. Bhandarkar,“Performance Characterization of the Alpha 21164Microprocessor using TP and SPEC Workloads,”Proceedings of HPCA-2, February 1996.
[Edmondson95] J. Edmondson et al, “SuperscalarInstruction Execution in the 21164 Microprocessor”,IEEE Micro, April 1995, pp.33-43.
[Papworth96] D. Papworth, “Tuning The Pentium®
Pro Microarchitecture,” IEEE Micro, April 1996, pp.8-15.
[Yeh91] Tse-Yu Yeh and Yale Patt, “Two-LevelAdaptive Training Branch Prediction,” Proc. IEEEMicro-24, Nov 1991, pp. 51-61. * Intel® and Pentium® are registered trademarks ofIntel Corporation. Other brands and names are theproperty of their respective owners.
Final paragraph
DEC was sold off to Compaq a few years later ... who sold of Digital Semiconductor to
Intel ... who still makes Alpha chips in small batches for HP (who bought Compaq).
27Tuesday, March 11, 14
UC Regents Spring 2014 © UCBCS 152 L15: Superscalars and Scoreboards
Break
Play:28Tuesday, March 11, 14
The CDC 6600 was the world’s fastest computer for 5 years (1964-1969).
The design team was located in a small town in Wisconsin, the home town of its leader, Seymour Cray.
The lab was placed far from CDC headquarters in Minneapolis, to limit
interference from upper management.29Tuesday, March 11, 14
Operator Console
Mainframe
Top view: a “+” sign
Tape Drives
Punched card reader
30Tuesday, March 11, 14
Top-down view: Entire main frame was liquid-cooled with Freon.
Transistor-based design, running at 100 ns clock speed.
64K of 60-bit words, implemented with magnetic core memory.
Bus wires: twisted wire pairs that were trimmed by hand to meet cycle time.
31Tuesday, March 11, 14
Museum collection ...
32Tuesday, March 11, 14
First commercial
use of display
consoles ... ran “space wars” vector games.
33Tuesday, March 11, 14
Freon cooling control panel.
34Tuesday, March 11, 14
Twisted pair bus wires.
Trimmed by hand.
35Tuesday, March 11, 14
Magnetic core memory module
36Tuesday, March 11, 14
Memory modules were hand-woven by former textile workers ... this is why machine cost
$7M in 1962 dollars!
37Tuesday, March 11, 14
Logic gate circuit modules ...
50 transistors: 2.5 x 2.5 x 0.8 inch
38Tuesday, March 11, 14
Peripheral processor invented
multithreading
Out-of-order execution.
“Scoreboard”
10 functional units
Long, variable latency
Register File
The first RISC machine
Includes eight 60-bit floating point
registers
Architecture
39Tuesday, March 11, 14
Instruction Fetch and the Scoreboard
The scoreboard controls the execution flow of all instructions. It’s goal is to maintain a CPI of 1.
The instruction fetch unit is decoupled. It’s goal is to pass one decoded instruction
to the scoreboard every cycle.The scoreboard holds decoded copies of all in-flight instructions, and tracks
the status of all elements cycle-by-cycle.40Tuesday, March 11, 14
Pending Issue
Awaiting operands
Execution in progress
Execution has
completed
Result is
written
Lifecycle of an
instruction in the
scoreboard (part 1)
Newly arrived instructions placed in this state, until
(1) a functional unit becomes free, and (2) no other issued instructions want
to write the register it wants to write.
If an instruction is in pending issue, the scoreboard stalls the instruction fetch unit.
Prevents WAW hazards.
41Tuesday, March 11, 14
Pending Issue
Awaiting operands
Execution in progress
Execution has
completed
Result is
written
Lifecycle of an
instruction in the
scoreboard (part 2)
Instructions remain in this state, until both of its operand registers are
not waiting to be written by a functional unit.
Prevents RAW hazards.
42Tuesday, March 11, 14
Pending Issue
Awaiting operands
Execution in progress
Execution has
completed
Result is
written
Lifecycle of an
instruction in the
scoreboard (part 3)
This state can last many cycles, as functional units have long latency.
43Tuesday, March 11, 14
Pending Issue
Awaiting operands
Execution in progress
Execution has
completed
Result is
written
Lifecycle of an
instruction in the
scoreboard (part 4)
Instructions may pass though this state, unless there is an instruction is Pending or Awaiting mode that
(1) preceded it in the instruction stream,(2) Pending/Awaiting instruction needs to read the register this instruction
plans to write.
Prevents WAR hazards.44Tuesday, March 11, 14
What the scoreboard
keeps score of.
The full status of each functional unit.(1) Is it running an instruction? Which one?
(2) What are its source/destination registers?(3) For each source: waiting/ready-to-read/read.
(4) For each source: who will be writing it?
For each register, which functional unit is planning to write it?
Current state of all in-flight instructions.
45Tuesday, March 11, 14
UC Regents Fall 2008 © UCBCS 194-6 L9: Advanced Processors I
Limitations of scoreboard control ...
If one accepts building a complicated machine, there are better ways to do it.
The Power5 scans fetched instructions forbranches (BP stage), and if it finds a branch,predicts the branch direction using threebranch history tables shared by the twothreads. Two of the BHTs use bimodal andpath-correlated branch prediction mecha-nisms to predict branch directions.6,7 Thethird BHT predicts which of these predictionmechanisms is more likely to predict the cor-
rect direction.7 If the fetched instructions con-tain multiple branches, the BP stage can pre-dict all the branches at the same time. Inaddition to predicting direction, the Power5also predicts the target of a taken branch inthe current cycle’s eight-instruction group. Inthe PowerPC architecture, the processor cancalculate the target of most branches from theinstruction’s address and offset value. For
43MARCH–APRIL 2004
MP ISS RF EA DC WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF
XferF6
Group formation andinstruction decode
Instruction fetch
Branch redirects
Interrupts and flushes
WB
Fmt
D1 D2 D3 Xfer GD
BPICCP
D0
IF
Branchpipeline
Load/storepipeline
Fixed-pointpipeline
Floating-point pipeline
Out-of-order processing
Figure 3. Power5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA =compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, andCP = group commit).
Shared by two threads Thread 0 resources Thread 1 resources
LSU0FXU0
LSU1
FXU1
FPU0
FPU1
BXU
CRL
Dynamicinstructionselection
Threadpriority
Group formationInstruction decode
Dispatch
Shared-register
mappers
Readshared-
register files
Sharedissue
queues
Sharedexecution
units
Alternate
Branch prediction
Instructioncache
Instructiontranslation
Programcounter
Branchhistorytables
Returnstack
Targetcache
DataCache
DataTranslation
L2cache
Datacache
Datatranslation
Instructionbuffer 0
Instructionbuffer 1
Writeshared-
register files
Groupcompletion
Storequeue
Figure 4. Power5 instruction data flow (BXU = branch execution unit and CRL = condition register logical execution unit).
DynamicScheduling:
After spring break.
46Tuesday, March 11, 14
On Thursday
Midterm Review Lecture
47Tuesday, March 11, 14