University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell
CS 352H: Computer Systems Architecture
Topic 10: Instruction Level Parallelism (ILP)
October 6 - 8, 2009
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2
Instruction-Level Parallelism (ILP)
Pipelining: executing multiple instructions inparallelTo increase ILP
Deeper pipelineLess work per stage ⇒ shorter clock cycle
Multiple issueReplicate pipeline stages ⇒ multiple pipelinesStart multiple instructions per clock cycleCPI < 1, so use Instructions Per Cycle (IPC)E.g., 4GHz 4-way multiple-issue
16 BIPS, peak CPI = 0.25, peak IPC = 4But dependencies reduce this in practice
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 3
Multiple Issue
Static multiple issueCompiler groups instructions to be issued togetherPackages them into “issue slots”Compiler detects and avoids hazards
Dynamic multiple issueCPU examines instruction stream and chooses instructions to issueeach cycleCompiler can help by reordering instructionsCPU resolves hazards using advanced techniques at runtime
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 4
Speculation
“Guess” what to do with an instructionStart operation as soon as possibleCheck whether guess was right
If so, complete the operationIf not, roll-back and do the right thing
Common to static and dynamic multiple issueExamples
Speculate on branch outcomeRoll back if path taken is different
Speculate on loadRoll back if location is updated
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 5
Compiler/Hardware Speculation
Compiler can reorder instructionse.g., move load before branchCan include “fix-up” instructions to recover from incorrect guess
Hardware can look ahead for instructions to executeBuffer results until it determines they are actually neededFlush buffers on incorrect speculation
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 6
Speculation and Exceptions
What if exception occurs on a speculatively executedinstruction?
e.g., speculative load before null-pointer checkStatic speculation
Can add ISA support for deferring exceptionsDynamic speculation
Can buffer exceptions until instruction completion (which may notoccur)
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 7
Static Multiple Issue
Compiler groups instructions into “issue packets”Group of instructions that can be issued on a single cycleDetermined by pipeline resources required
Think of an issue packet as a very long instructionSpecifies multiple concurrent operations⇒ Very Long Instruction Word (VLIW)
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 8
Scheduling Static Multiple Issue
Compiler must remove some/all hazardsReorder instructions into issue packetsNo dependencies with a packetPossibly some dependencies between packets
Varies between ISAs; compiler must know!Pad with nop if necessary
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 9
MIPS with Static Dual Issue
Two-issue packetsOne ALU/branch instructionOne load/store instruction64-bit aligned
ALU/branch, then load/storePad an unused instruction with nop
n + 20n + 16n + 12n + 8n + 4nAddress
WBMEM
EXIDIFLoad/storeWBME
MEXIDIFALU/branch
WBMEM
EXIDIFLoad/storeWBME
MEXIDIFALU/branch
WBMEM
EXIDIFLoad/storeWBME
MEXIDIFALU/branchPipeline StagesInstruction
type
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 10
MIPS with Static Dual Issue
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 11
Hazards in the Dual-Issue MIPS
More instructions executing in parallelEX data hazard
Forwarding avoided stalls with single-issueNow can’t use ALU result in load/store in same packet
add $t0, $s0, $s1load $s2, 0($t0)Split into two packets, effectively a stall
Load-use hazardStill one cycle use latency, but now two instructions
More aggressive scheduling required
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 12
Scheduling Example
Schedule this for dual-issue MIPS
Loop: lw $t0, 0($s1) # $t0=array element addu $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result addi $s1, $s1,–4 # decrement pointer bne $s1, $zero, Loop # branch $s1!=0
4sw $t0, 4($s1)bne $s1, $zero, Loop3nopaddu $t0, $t0, $s22nopaddi $s1, $s1,–41lw $t0, 0($s1)nopLoop
:
cycleLoad/storeALU/branch
IPC = 5/4 = 1.25 (c.f. peak IPC = 2)
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 13
Loop Unrolling
Replicate loop body to expose more parallelismReduces loop-control overhead
Use different registers per replicationCalled “register renaming”Avoid loop-carried “anti-dependencies”
Store followed by a load of the same registerAka “name dependence”
Reuse of a register name
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 14
Loop Unrolling Example
IPC = 14/8 = 1.75Closer to 2, but at cost of registers and code size
3lw $t2, 8($s1)addu $t0, $t0, $s24lw $t3, 4($s1)addu $t1, $t1, $s25sw $t0, 16($s1)addu $t2, $t2, $s26sw $t1, 12($s1)addu $t3, $t4, $s2
8sw $t3, 4($s1)bne $s1, $zero, Loop7sw $t2, 8($s1)nop
2lw $t1, 12($s1)nop1lw $t0, 0($s1)addi $s1, $s1,–16Loop
:
cycleLoad/storeALU/branch
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 15
Dynamic Multiple Issue
“Superscalar” processorsCPU decides whether to issue 0, 1, 2, … each cycle
Avoiding structural and data hazards
Avoids the need for compiler schedulingThough it may still helpCode semantics ensured by the CPU
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 16
Dynamic Pipeline Scheduling
Allow the CPU to execute instructions out of order toavoid stalls
But commit result to registers in order
Examplelw $t0, 20($s2)addu $t1, $t0, $t2sub $s4, $s4, $t3slti $t5, $s4, 20Can start sub while addu is waiting for lw
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 17
Dynamically Scheduled CPU
Results also sentto any waiting
reservationstations
Reorders buffer forregister writes Can supply
operands forissued instructions
Preservesdependencies
Hold pendingoperands
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 18
Register Renaming
Reservation stations and reorder buffer effectively provideregister renamingOn instruction issue to reservation station
If operand is available in register file or reorder bufferCopied to reservation stationNo longer required in the register; can be overwritten
If operand is not yet availableIt will be provided to the reservation station by a function unitRegister update may not be required
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 19
Speculation
Predict branch and continue issuingDon’t commit until branch outcome determined
Load speculationAvoid load and cache miss delay
Predict the effective addressPredict loaded valueLoad before completing outstanding storesBypass stored values to load unit
Don’t commit load until speculation cleared
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 20
Why Do Dynamic Scheduling?
Why not just let the compiler schedule code?Not all stalls are predicable
e.g., cache misses
Can’t always schedule around branchesBranch outcome is dynamically determined
Different implementations of an ISA have differentlatencies and hazards
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 21
Does Multiple Issue Work?
Yes, but not as much as we’d likePrograms have real dependencies that limit ILPSome dependencies are hard to eliminate
e.g., pointer aliasing
Some parallelism is hard to exposeLimited window size during instruction issue
Memory delays and limited bandwidthHard to keep pipelines full
Speculation can help if done well
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 22
Power Efficiency
Complexity of dynamic scheduling and speculationsrequires powerMultiple simpler cores may be better
70W8No161200MHz
2005UltraSparcT1
90W1No4141950MHz
2003UltraSparcIII
75W2Yes4142930MHz
2006Core103W1Yes3313600MH
z2004P4 Prescott
75W1Yes3222000MHz
2001P4Willamette
29W1Yes310200MHz1997PentiumPro
10W1No2566MHz1993Pentium5W1No1525MHz1989i486
PowerCoresOut-of-order/
Speculation
Issuewidth
Pipeline
Stages
ClockRate
YearMicroprocessor
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 23
The Opteron X4 Microarchitecture
72 physicalregisters
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 24
The Opteron X4 Pipeline Flow
For integer operations
FP is 5 stages longerUp to 106 RISC-ops in progress
BottlenecksComplex instructions with long dependenciesBranch mispredictionsMemory access delays
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 25
Concluding Remarks
Multiple issue and dynamic scheduling (ILP)Dependencies limit achievable parallelismComplexity leads to the power wall