View
219
Download
1
Tags:
Embed Size (px)
Citation preview
12004 Morgan Kaufmann Publishers
Chapter Six
Enhancing Performance with Pipelining
22004 Morgan Kaufmann Publishers
32004 Morgan Kaufmann Publishers
Outline
• 6.1 An overview of pipelining
• 6.2 A pipelined Datapath
• 6.3 Pipelined Control
• 6.4 Data Hazards and Forwarding
• 6.5 Data Hazards and Stalls
• 6.6 Branch Hazards
• 6.7 Using a Hardware Description Language to describe and Model a pipeline
• 6.8 Exceptions
• 6.9 Advanced Pipelining: Extracting More Performance
• 6.10 Real Stuff: The Pentium 4 Pipeline
• 6.11 Fallacies and Pitfalls
• 6.12 Concluding Remarks
• 6.13 Historical Perspective and Further Reading
42004 Morgan Kaufmann Publishers
6.1 An overview of Pipelining
52004 Morgan Kaufmann Publishers
Keywords
• Pipelining An implementation technique in which multiple instructions are overlapped in execution, much like to an assembly line.
• Structural hazard An occurrence in which a planned instruction cannot execute in the proper clock cycle because the hardware cannot support the combination of instructions that are set to execute in the given clock cycle.
• Data hazard Also called pipeline data hazard. An occurrence in which a planned instruction cannot execute in the proper clock cycle because data that is needed to execute the instruction is not yet available.
• Forwarding Also called bypassing. A method of resolving a data hazard by retrieving the missing data element from internal buffers rather than waiting for it to arrive from programmer-visible registers or memory.
• Load-use data hazard A specific form of data hazard in which the data requested by a load instruction has not yet become available when it is requested.
62004 Morgan Kaufmann Publishers
Keywords
• Pipeline stall Also called bubble. A stall initiated in order to resolve a hazard.
• Control hazard Also called branch hazard. An occurrence in which the proper instruction cannot execute in the proper clock cycle because the instruction that was fetched is not the one that is needed; that is, the flow of instruction addresses is not what the pipeline expected.
• Untaken branch One that falls through to the successive instruction. A taken branch is one that causes transfer to the branch target.
• Branch prediction A method of resolving a branch hazard that assumes a given outcome for the branch and proceeds from that assumption rather than waiting to ascertain the actual outcome.
• Latency (pipeline) The number of stages in a pipeline or the number of stage between two instructions during execution.
72004 Morgan Kaufmann Publishers
Figure 6.1 The laundry analogy for pipelining.
82004 Morgan Kaufmann Publishers
Figure 6.2 Total time for each instruction calculated from the time for each component.
Instruction classInstruction
FetchRegister
ReadALU
operationData
accessRegister
writeTotal time
Load word (lw) 200 ps 100 ps 200 ps 200 ps 100 ps 800 ps
Store word (sw) 200 ps 100 ps 200 ps 200 ps 700 ps
R-format (add, sub, and, or, slt)
200 ps 100 ps 200 ps 100 ps 600 ps
Branch (beq) 200 ps 100 ps 200 ps 500 ps
92004 Morgan Kaufmann Publishers
Pipelining
• Improve performance by increasing instruction throughput
Ideal speedup is number of stages in the pipeline. Do we achieve this?
Programexecutionorder(in instructions)
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
Time200 400 600 800 1000 1200 1400 1600 1800
Instructionfetch Reg ALU Data
access Reg
Instructionfetch Reg ALU Data
access Reg
Instructionfetch
800 ps
800 ps
800 ps
Programexecutionorder(in instructions)
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
Time200 400 600 800 1000 1200 1400
Instructionfetch Reg ALU Data
access Reg
Instructionfetch
Instructionfetch
Reg ALU Dataaccess Reg
Reg ALU Dataaccess Reg
200 ps
200 ps
200 ps 200 ps 200 ps 200 ps 200 ps
Note: timing assumptions changedfor this example
102004 Morgan Kaufmann Publishers
Figure 6.4 Graphical representation of the instruction pipeline, similar in spirit to the laundry pipeline in figure 6.4 on page 371.
112004 Morgan Kaufmann Publishers
Figure 6.5 Graphical representation of forwarding
122004 Morgan Kaufmann Publishers
Figure 6.6 We need a stall even with forwarding when an R-format instruction following a load tries to use the data.
132004 Morgan Kaufmann Publishers
Figure 6.7 Pipeline showing stalling on every conditional branch as solution to control hazards.
142004 Morgan Kaufmann Publishers
Figure 6.8 Predicting that branches are not taken as a solution to control hazard.
152004 Morgan Kaufmann Publishers
Pipelining
• What makes it easy– all instructions are the same length– just a few instruction formats– memory operands appear only in loads and stores
• What makes it hard?– structural hazards: suppose we had only one memory– control hazards: need to worry about branch instructions– data hazards: an instruction depends on a previous instruction
• We’ll build a simple pipeline and look at these issues
• We’ll talk about modern processors and what really makes it hard:– exception handling– trying to improve performance with out-of-order execution, etc.
162004 Morgan Kaufmann Publishers
6.2 A pipelined Datapath
172004 Morgan Kaufmann Publishers
Basic Idea
• What do we need to add to actually split the datapath into stages?
WB: Write backMEM: Memory accessIF: Instruction fetch ID: Instruction decode/register file read
EX: Execute/address calculation
Address
Writedata
Readdata
DataMemory
Readregister 1Readregister 2
WriteregisterWritedata
Registers
Readdata 1
Readdata 2
ALUZeroALUresult
ADDAddresult
Shiftleft 2
Address
Instruction
Instructionmemory
Add
4
PC
Signextend
16 32
182004 Morgan Kaufmann Publishers
6.10 Instructions being executed using the single-cycle datapath in figure 6.9, assuming pipelined execution.
192004 Morgan Kaufmann Publishers
Pipelined Datapath
Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem?
Add
Address
Instructionmemory
Readregister 1
Readregister 2
Writeregister
Writedata
Readdata 1
Readdata 2
RegistersAddress
Writedata
Readdata
Datamemory
Add Addresult
ALU ALUresult
Zero
Shiftleft 2
Signextend
PC
4
ID/EXIF/ID EX/MEM MEM/WB
16 32
202004 Morgan Kaufmann Publishers
Figure 6.12 IF and ID: first and second pipe stages of an instruction, with the active portions of the datapath in figure 6.11 highlighted.
212004 Morgan Kaufmann Publishers
222004 Morgan Kaufmann Publishers
Figure 6.13 EX: the third pipe stage of a load instruction, highlighting the portions of the datapath in figure 6.11 used in this pipe stage.
232004 Morgan Kaufmann Publishers
Figure 6.14 MEM and WB: the fourth and fifth pipe stages of a load instruction, highlighting the portions of the datapath in figure 6.11 used in this pipe stage.
242004 Morgan Kaufmann Publishers
252004 Morgan Kaufmann Publishers
Figure 6.15 EX: the third pipe stage of a store instruction.
262004 Morgan Kaufmann Publishers
Figure 6.16 MEM and WB: the fourth and fifth pipe stage of a store instruction.
272004 Morgan Kaufmann Publishers
282004 Morgan Kaufmann Publishers
Corrected Datapath
292004 Morgan Kaufmann Publishers
Graphically Representing Pipelines
• Can help with answering questions like:
– how many cycles does it take to execute this code?
– what is the ALU doing during cycle 4?
– use this representation to help understand datapaths
Programexecutionorder(in instructions)
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
Time (in clock cycles)
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC7
IM DMReg RegALU
IM DMReg RegALU
IM DMReg RegALU
302004 Morgan Kaufmann Publishers
Figure 6.18 The portion of the datapath in figure 6.17 that is used in all five stages of a load instruction.
312004 Morgan Kaufmann Publishers
Figure 6.19 Multiple-clock-cycle pipeline diagram of five instructions.
322004 Morgan Kaufmann Publishers
Figure 6.20 Traditional multiple-clock-cycle pipeline diagram of five instructions in figure 6.19.
332004 Morgan Kaufmann Publishers
Figure 6.21 The single-clock-cycle diagram corresponding to clock cycle 5 of the pipeline in figures 6.19 and 6.20.
342004 Morgan Kaufmann Publishers
6.3 Pipelined Control
352004 Morgan Kaufmann Publishers
Pipeline Control
362004 Morgan Kaufmann Publishers
• We have 5 stages. What needs to be controlled in each stage?
– Instruction Fetch and PC Increment
– Instruction Decode / Register Fetch
– Execution
– Memory Stage
– Write Back
• How would control be handled in an automobile plant?
– a fancy control center telling everyone what to do?
– should we use a finite state machine?
Pipeline control
372004 Morgan Kaufmann Publishers
Figure 6.23 A copy of figure 5.12 on page 302.
Instruction opcode ALUop
Instruction operation
Function code
Desired ALU action
ALU control input
LW 00 load word XXXXXX add 0010
SW 00 store word XXXXXX add 0010
Branch equal 01 branch equal XXXXXX subtract 0110
R-type 10 add 100000 add 0010
R-type 10 subtract 100010 subtract 0110
R-type 10 AND 100100 and 0000
R-type 10 OR 100101 or 0001
R-type 10 set on less than 101010 set on less than 0111
382004 Morgan Kaufmann Publishers
Figure 6.24 A copy of figure 5.16 on page 306.
Signal name Effect when deasserted (0) Effect when asserted (1)
RegDst The register destination number for the Write register comes from the rt field (bit 20:16).
The register destination number for the Write register comes from the rd field (bits 15:11).
RegWrite None. The register on the Write register input is written with the value on the Write data input.
ALUSrc The second ALU operand comes from the second register file output (Read data 2).
The second ALU operand is the single-extended, lower 16 bits of the instruction.
PCSrc The PC is replaced by the output of the adder that computes the value of PC+4.
The PC is replaced by the output of the adder that computes the branch target.
MemRead None. Data memory contents designated by the address input are put on the Read data output.
MemWrite None. Data memory contents designated by the address input are replaced by the value on the Write data input.
MemtoReg The value fed to the register Write data input comes from the ALU.
The value fed to the register Write data input comes from the data memory.
392004 Morgan Kaufmann Publishers
Figure 6.25 The values of the control lines are the same as in figure 5.18 on page 308, but they have been shuffled into three groups corresponding to the last three pipeline stages.
Instruction
Execution/address calculation stage control lines
Memory access stage control lines
Write-back stage control lines
Reg
Dst
ALU
Op1
ALU
Op0
ALU
Src
Branch
Mem Read
Mem Write
Reg Write
Mem to Reg
R-format 1 1 0 0 0 0 0 1 0
lw 0 0 0 1 0 1 0 1 1
sw X 0 0 1 0 0 1 0 X
beq X 0 1 0 1 0 0 0 X
402004 Morgan Kaufmann Publishers
6.4 Data Hazards and Forwarding
412004 Morgan Kaufmann Publishers
• Pass control signals along just like the data
Pipeline Control
Execution/Address Calculation stage control lines
Memory access stage control lines
Write-back stage control
lines
InstructionReg Dst
ALU Op1
ALU Op0
ALU Src Branch
Mem Read
Mem Write
Reg write
Mem to Reg
R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X
Control
EX
M
WB
M
WB
WB
IF/ID ID/EX EX/MEM MEM/WB
Instruction
422004 Morgan Kaufmann Publishers
Datapath with Control
432004 Morgan Kaufmann Publishers
• Problem with starting next instruction before first is finished
– dependencies that “go backward in time” are data hazards
Dependencies
Programexecutionorder(in instructions)
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Time (in clock cycles)CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
IM DMReg Reg
IM DMReg Reg
IM DMReg Reg
IM DMReg Reg
IM DMReg Reg
10 10 10 10 10/–20 –20 –20 –20 –20Value ofregister $2:
442004 Morgan Kaufmann Publishers
• Have compiler guarantee no hazards
• Where do we insert the “nops” ?
sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)
• Problem: this really slows us down!
Software Solution
452004 Morgan Kaufmann Publishers
EX hazard
if (EX / MEM . RegWrite
and (EX / MEM . RegisterRd ≠ 0)
and (EX / MEM . RegisterRd = ID / EX . RegisterRs) ) ForwardA = 10
if (EX / MEM . RegWrite
and (EX / MEM . RegisterRd ≠ 0)
and (EX / MEM . RegisterRd = ID / EX . RegisterRt) ) ForwardB = 10
462004 Morgan Kaufmann Publishers
MEM harzard
if (MEM / WB . RegWrite
and (MEM / WB . RegisterRd ≠ 0)
and (MEM / WB . RegisterRd = ID / EX . RegisterRs) ) ForwardA = 01
if (MEM / WB . RegWrite
and (MEM / WB . RegisterRd ≠ 0)
and (MEM / WB . RegisterRd = ID / EX . RegisterRt) ) ForwardB = 01
472004 Morgan Kaufmann Publishers
• Use temporary results, don’t wait for them to be written
– register file forwarding to handle read/write to same register
– ALU forwarding
Forwarding
what if this $2 was $13?
Programexecutionorder(in instructions)
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14,$2 , $2
sw $15, 100($2)
Time (in clock cycles)CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
IM DMReg Reg
IM DMReg Reg
IM DMReg Reg
IM DMReg Reg
IM DMReg Reg
10 10 10 10 10/–20 –20 –20 –20 –20Value of register $2:Value of EX/MEM: X X X –20 X X X X XValue of MEM/WB: X X X X –20 X X X X
482004 Morgan Kaufmann Publishers
Figure 6.30 On the top are the ALU and pipeline registers before adding forwarding.
492004 Morgan Kaufmann Publishers
Forwarding
• The main idea (some details not shown)
502004 Morgan Kaufmann Publishers
Figure 6.31 The control values for the forwarding multiplexors in figure 6.30.
Mux control Source Explanation
ForwardA = 00 ID / EX The first ALU operand comes from the register file.
ForwardA = 10 EX / MEM The first ALU operand is forwarded from the prior ALU result.
ForwardA = 01 MEM / WB The first ALU operand is forward from data memory or an earlier ALU result.
ForwardB = 00 ID / EX The second ALU operand comes from the register file.
ForwardB = 10 EX / MEM The second ALU operand is forwarded from the prior ALU result.
ForwardB = 01 MEM / WB The second ALU operand is forward from data memory or an earlier ALU result.
512004 Morgan Kaufmann Publishers
Figure 6.32 The datapath modified to resolve hazards via forwarding.
522004 Morgan Kaufmann Publishers
Figure 6.33 A close-up of the datapath in figure 6.30 on page 409 shows a 2:1 multiplexor, which has been added to select the signed immediate as an ALU input.
532004 Morgan Kaufmann Publishers
6.5 Data hazards and Stalls
542004 Morgan Kaufmann Publishers
Keywords
• nop An instruction that does no operation to change state.
552004 Morgan Kaufmann Publishers
• Load word can still cause a hazard:– an instruction tries to read a register following a load instruction
that writes to the same register.
• Thus, we need a hazard detection unit to “stall” the load instruction
Can't always forward
Programexecutionorder(in instructions)
lw $2, 20($1)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
Time (in clock cycles)CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
IM DMReg Reg
IM DMReg Reg
IM DMReg Reg
IM DMReg Reg
IM DMReg Reg
562004 Morgan Kaufmann Publishers
Stalling
• We can stall the pipeline by keeping an instruction in the same stage
bubble
Programexecutionorder(in instructions)
lw $2, 20($1)
and becomes nop
add $4, $2, $5
or $8, $2, $6
add $9, $4, $2
Time (in clock cycles)
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10
IM DMReg Reg
IM DMReg Reg
IM DMReg Reg
IM DMReg Reg
IM DMReg Reg
572004 Morgan Kaufmann Publishers
Hazard Detection Unit
• Stall by letting an instruction that won’t write anything go forward
582004 Morgan Kaufmann Publishers
6.6 Branch Hazards
592004 Morgan Kaufmann Publishers
Keywords
• Flush (instructions) To discard instructions in a pipeline, usually due to an unexpected event.
• Dynamic branch prediction Prediction of branches at runtime using runtime information.
• Branch prediction buffer Also called branch history table. A small memory that is indexed by the lower portion of the address of the branch instruction and that contains one or more bits indicating whether the branch was recently taken or not.
• Branch delay slot The slot directly after a delayed branch instruction, which in the MIPS architecture is filled by an instruction that does not affect the branch.
• Branch target buffer A structure that caches the destination PC or destination instruction for a branch. It is usually organized as a cache with tags, making it more costly than a simple prediction buffer.
602004 Morgan Kaufmann Publishers
Keywords
• Correlating predictor A branch predictor that combines local behavior of a particular branch and global information about the behavior of some recent number of executed branches.
• Tournament branch predictor A branch predictor with multiple predictions for each branch and a selection mechanism that chooses which predictor to enable for a given branch.
612004 Morgan Kaufmann Publishers
• When we decide to branch, other instructions are in the pipeline!
• We are predicting “branch not taken”
– need to add hardware for flushing instructions if we are wrong
Branch Hazards
Reg
Programexecutionorder(in instructions)
40 beq $1, $3, 28
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
Time (in clock cycles)CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
IM DMReg Reg
IM DMReg Reg
IM DM Reg
IM DMReg Reg
IM DMReg Reg
622004 Morgan Kaufmann Publishers
Figure 6.38 The ID stage of clock cycle 3 determines that a branch must be taken, so it selects 72 as the next PC address and zeros the instruction fetched for the next clock cycle.
632004 Morgan Kaufmann Publishers
642004 Morgan Kaufmann Publishers
Branches
• If the branch is taken, we have a penalty of one cycle• For our simple design, this is reasonable• With deeper pipelines, penalty increases and static branch prediction
drastically hurts performance• Solution: dynamic branch prediction
Predict taken Predict taken
Predict not taken Predict not taken
Not taken
Not taken
Not taken
Not taken
Taken
Taken
Taken
Taken
A 2-bit prediction scheme
652004 Morgan Kaufmann Publishers
Branch Prediction
• Sophisticated Techniques:
– A “branch target buffer” to help us look up the destination
– Correlating predictors that base prediction on global behaviorand recently executed branches (e.g., prediction for a specific
branch instruction based on what happened in previous branches)
– Tournament predictors that use different types of prediction strategies and keep track of which one is performing best.
– A “branch delay slot” which the compiler tries to fill with a useful instruction (make the one cycle delay part of the ISA)
• Branch prediction is especially important because it enables other more advanced pipelining techniques to be effective!
• Modern processors predict correctly 95% of the time!
662004 Morgan Kaufmann Publishers
Figure 6.40 Scheduling the branch delay slot.
672004 Morgan Kaufmann Publishers
6.7 Using a Hardware Description Language to Describe and Model a Pipeline
682004 Morgan Kaufmann Publishers
Figure 6.41 The final datapath and control for this chapter.
692004 Morgan Kaufmann Publishers
6.8 Exceptions
702004 Morgan Kaufmann Publishers
Keywords
• Imprecise interrupt Also called imprecise exception. Interrupts or exceptions in pipelined computers that are not associated with the exact instruction that was the cause of the interrupt or exception.
• Precise interrupt Also called precise exception. An interrupt or exception that is always associated with the correct instruction in pipelined computers.
712004 Morgan Kaufmann Publishers
Figure 6.42 The datapath with controls to handle exceptions.
722004 Morgan Kaufmann Publishers
Figure 6.43 The result of an exception due to arithmetic overflow in the add instruction.
732004 Morgan Kaufmann Publishers
742004 Morgan Kaufmann Publishers
6.9 Advanced Pipelining: Extracting More Performance
752004 Morgan Kaufmann Publishers
Keywords
• Instruction-level parallelism The parallelism among instructions.
• Multiple issue A scheme whereby multiple instructions are launched in 1 clock cycle.
• Static multiple issue An approach to implementing a multiple-issue processor where many decisions are made by the compiler before execution.
• Dynamic multiple issue An approach to implementing a multiple-issue processor where many decisions are made during execution by the processor.
• Issue slots The positions from which instructions could issue in a given clock cycle; by analogy these correspond to positions at the starting blocks for a sprint.
• Speculation An approach whereby the compiler or processor guesses the outcome of an instruction to remove it as a dependence in executing other instructions.
762004 Morgan Kaufmann Publishers
Keywords
• Issue packet The set of instructions that issues together in 1 clock cycle; the packet may be determined statically by the compiler or dynamically by the processor.
• Loop unrolling A technique to get more performance from loops that access arrays, in which multiple copies of the loop body are made and instructions from different iterations are scheduled together.
• Register renaming The renaming of registers, by the compiler or hardware, to remove antidependences.
• Antidependences Also called name dependence. An ordering forced by the reuse of a name, typically a register, rather then by a true dependence that carries a value between two..
• Instruction group In IA-64, a sequence of consecutive instructions with no register data dependences among them.
772004 Morgan Kaufmann Publishers
Keywords
• Stop In IA-64, an explicit indicator of a break between independent and dependent instructions.
• Prediction A technique to make instructions dependent on predicates rather than on branches.
• Poison A result generated when a speculative load yields an exception, or an instruction uses a poisoned operand.
• Advanced load In IA-64, a speculative load instruction with support to check for aliases that could invalidate the load.
• Superscalar An advanced pipelining technique that enables the processor to execute more than one instruction per clock cycle.
• Dynamic pipeline scheduling Hardware support for recording the order of instruction execution so as to avoid stalls.
• Commit unit The unit in a dynamic or out-of-order execution pipeline that decides when it is safe to release the result of an operation to programmer-visible registers and memory.
782004 Morgan Kaufmann Publishers
Keywords
• Reservation station A buffer within a functional unit that holds the operands and the operation.
• Reorder buffer The buffer that holds results in a dynamically scheduled processor unit it is safe to store the results to memory or a register.
• In-order commit A commit in which the results of pipelined execution are written to the programmer-visible state in the same order that instructions are fetched.
• Out-of-order execution A situation in pipelined execution when an instruction blocked from executing does not cause the following instructions to wait.
792004 Morgan Kaufmann Publishers
Figure 6.44 Static two-issue pipeline in operation.
Instruction type Pipe stages
ALU or branch instruction IF ID EX MEM WB
Load or store instruction IF ID EX MEM WB
ALU or branch instruction IF ID EX MEM WB
Load or store instruction IF ID EX MEM WB
ALU or branch instruction IF ID EX MEM WB
Load or store instruction IF ID EX MEM WB
ALU or branch instruction IF ID EX MEM WB
Load or store instruction IF ID EX MEM WB
802004 Morgan Kaufmann Publishers
Figure 6.45 A static two-issue datapath.
812004 Morgan Kaufmann Publishers
Figure 6.46 The scheduled code as it would look on a two-issue MIPS pipeline.
ALU or branch instruction Data transfer instruction Clock cycle
Loop:
lw $t0, 0($s1) 1
addi $s1, $s1, -4 2
addu $t0, $t0, $s2 3
bne $s1, $zero, Loop sw $t0, 4($s1) 4
822004 Morgan Kaufmann Publishers
Figure 6.47 The unrolled and scheduled code of figure 6.46 as it would look on a static.
ALU or branch instruction Data transfer instruction Clock cycle
Loop:
addi $s1, $s1, -16 lw $t0, 0($s1) 1
lw $t1, 12($s1) 2
addu $t0, $t0, $s2 lw $t2, 8($s1) 3
addu $t1, $t1, $s2 lw $t3, 4($s1) 4
addu $t2, $t2, $s2 sw $t0, 16($s1) 5
addu $t3, $t3, $s2 sw $t1, 12($s1) 6
sw $t2, 8($s1) 7
bne $s1, $zero, Loop sw $t3, 4($s1) 8
832004 Morgan Kaufmann Publishers
Figure 6.48 A summary of the characteristics of the Itanium and Itanium2, Intel’s first two implementations of the IA-64 architecture.
Processor
Max. instr.
Issues / clock
Functional units
Max. ops. Per
clock
Max. clock rate
Transistors (millions)
Power (watts)
SPEC int2000
SPEC fp2000
Itanium 6 4 integer/media
2 memory
3 branch
2 FP
9 0.8 GHz 25 130 379 701
Itanium 2 6 6 integer/media
4 memory
3 branch
2 FP
11 1.5 GHz 221 130 810 1427
842004 Morgan Kaufmann Publishers
Improving Performance
• Try and avoid stalls! E.g., reorder these instructions:
lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)
• Dynamic Pipeline Scheduling
– Hardware chooses which instructions to execute next
– Will execute instructions out of order (e.g., doesn’t wait for a dependency to be resolved, but rather keeps going!)
– Speculates on branches and keeps the pipeline full (may need to rollback if prediction incorrect)
• Trying to exploit instruction-level parallelism
852004 Morgan Kaufmann Publishers
Figure 6.49 The three primary units of a dynamically scheduled pipeline.
862004 Morgan Kaufmann Publishers
Advanced Pipelining
• Increase the depth of the pipeline
• Start more than one instruction each cycle (multiple issue)
• Loop unrolling to expose more ILP (better scheduling)
• “Superscalar” processors
– DEC Alpha 21264: 9 stage pipeline, 6 instruction issue
• All modern processors are superscalar and issue multiple instructions usually with some limitations (e.g., different “pipes”)
• VLIW: very long instruction word, static multiple issue (relies more on compiler technology)
• This class has given you the background you need to learn more!
872004 Morgan Kaufmann Publishers
6.10 Real Stuff: The Pentium 4 Pipeline
882004 Morgan Kaufmann Publishers
Keywords
• Microarchitecture The organization of the processor, including the major functional units, their interconnection, and control.
• Architectural registers The instruction set visible registers of a processor; for example, in MIPS, these are 32 integer and 16 floating-point registers.
892004 Morgan Kaufmann Publishers
Figure 6.50 The microarchitecture of the Intel Pentium 4.
902004 Morgan Kaufmann Publishers
Figure 6.51 The Pentium 4 pipeline showing the pipeline flow for a typical instruction and the number of clock cycles for the major steps in the pipeline.
912004 Morgan Kaufmann Publishers
6.11 Fallacies and Pitfalls
922004 Morgan Kaufmann Publishers
• Fallacy: Pipelining is easy.
• Fallacy: Pipelining ideas can be implemented independent of technology.
• Pitfall: Failure to consider instruction set design can adversely impact pipelining.
932004 Morgan Kaufmann Publishers
6.12 Concluding Remarks
942004 Morgan Kaufmann Publishers
Keywords
• Instruction latency The inherent execution time for an instruction.
952004 Morgan Kaufmann Publishers
Chapter 6 Summary
• Pipelining does not improve latency, but does improve throughput
Slower Faster
Instructions per clock (IPC = 1/CPI)
Multicycle(Section 5.5)
Single-cycle(Section 5.4)
Deeplypipelined
Pipelined
Multiple issuewith deep pipeline
(Section 6.10)
Multiple-issuepipelined
(Section 6.9)
962004 Morgan Kaufmann Publishers
1 Several
Use latency in instructions
Multicycle(Section 5.5)
Single-cycle(Section 5.4)
Deeplypipelined
Pipelined
Multiple issuewith deep pipeline
(Section 6.10)
Multiple-issuepipelined
(Section 6.9)