Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
1
Lecture: Pipelining Basics
• Topics: Basic pipelining implementation
What is pipelining?Clocks and latchesAn example 5-stage pipelineLoads/Stores and RISC/CISC: HazardsExamples of Hazards
2
Building a Car
Start and finish a job before moving to the next
Time
Jobs
Unpipelined
If each car takes 24 hrs to build,
throughput will be 1 car / 24 hr 24hr
3
The Assembly Line
A
Time
Jobs
Pipelined
B C
A B C
A B C
A B C
Break the job into smaller stages
Now 24 hours task is broken into 3 slots. So each stage is taking 8 hours and by overlapping of execution , we are getting throughput of 1car/8hrs
4
Clocks and Latches
Stage 1 Stage 2L
Clk
L
What is the necessity of using latch between two clock cycles??
5
Some Equations
• Unpipelined: time to execute one instruction = T + Tovh
• For an N-stage pipeline, time per stage = T/N + Tovh
• Total time per instruction = N (T/N + Tovh) = T + N Tovh
• Clock cycle time = T/N + Tovh
• Clock speed = 1 / (T/N + Tovh)• Ideal speedup = (T + Tovh) / (T/N + Tovh)• Cycles to complete one instruction = N• Average CPI (cycles per instr) = 1
6
Problem
• An unpipelined processor takes 5 ns to work on one instruction. It then takes 0.2 ns to latch its results into latches. I was able to convert the circuits into 5 sequential pipeline stages. The stages have the following lengths: 1ns; 0.6ns; 1.2ns; 1.4ns; 0.8ns. Answer the following, assuming that there are no stalls in the pipeline.
What is the cycle time in the new processor? 1.6ns What is the clock speed? 625 MHz What is the IPC? 1 How long does it take to finish one instr? 8ns What is the speedup from pipelining? 625/192 = 3.26 What is the max speedup from pipelining? 5.2/0.2 = 26
7
A 5-Stage Pipeline
pc
8
For R(register) type instruction,
So the first one is the instruction memory stage, there is a latch that serves as an input to the instruction memory stage, and this latch stores the program counter.
It tells you where exactly in your program you're currently executing And this program counter serves as an input to the instruction memory stage. So what is then done is with this input PC, you go to that location in memory and fetch that construction, and could be fetching that instruction from an instruction cache or you could be fetching it from the actual system memory itself. And once you fetch that instruction, that instruction is provided as an input to next latch.
And so the next clock edge, that instruction get stored over in latch 2. It then serves as an input to the next stage. What is also happening in this stage is that the value of the PC is sent to an adder where you do PC plus 4, and then that value is also fed as an input back to PC
Task of different stages
9
10
11
Problem 3
• For the following code sequence, show how the instrs flow through the pipeline: ADD R1, R2, R3 BEZ R4, [R5] LD [R6] R7 ST [R8] R9
12
Pipeline Summary
RR ALU DM RW
ADD R1, R2, R3 Rd R1,R2 R1+R2 -- Wr R3
BEZ R1, [R5] Rd R1, R5 -- -- -- Compare, Set PC
LD 8[R3] R6 Rd R3 R3+8 Get data Wr R6
ST 8[R3] R6 Rd R3,R6 R3+8 Wr data --
13
Pipelining Hazards
14
Hazards
• Structural hazards: different instructions in different stages (or the same stage) conflicting for the same resource
• Data hazards: an instruction cannot continue because it needs a value that has not yet been generated by an earlier instruction
• Control hazard: fetch cannot continue because it does not know the outcome of an earlier branch – special case of a data hazard – separate category because they are treated in different ways
15
Structural Hazards
• Example: a unified instruction and data cache stage 4 (MEM) and stage 1 (IF) can never coincide
• The later instruction and all its successors are delayed until a cycle is found when the resource is free these are pipeline bubbles
• Structural hazards are easy to eliminate – increase the number of resources (for example, implement a separate instruction and data cache)
16
Control Hazards
• Simple techniques to handle control hazard stalls: for every branch, introduce a stall cycle (note: every 6th instruction is a branch on average!) assume the branch is not taken and start fetching the next instruction – if the branch is taken, need hardware to cancel the effect of the wrong-path instructions predict the next PC and fetch that instr – if the prediction is wrong, cancel the effect of the wrong-path instructions fetch the next instruction (branch delay slot) and execute it anyway – if the instruction turns out to be on the correct path, useful work was done – if the instruction turns out to be on the wrong path, hopefully program state is not lost
17
Branch Delay Slots
18
Problem 1
• Consider a branch that is taken 80% of the time. On average, how many stalls are introduced for this branch for each approach below:
Stall fetch until branch outcome is known Assume not-taken and squash if the branch is taken Assume a branch delay slot
o You can’t find anything to put in the delay sloto An instr before the branch is put in the delay sloto An instr from the taken side is put in the delay sloto An instr from the not-taken side is put in the slot
19
Problem 1
• Consider a branch that is taken 80% of the time. On average, how many stalls are introduced for this branch for each approach below:
Stall fetch until branch outcome is known – 1 Assume not-taken and squash if the branch is taken – 0.8 Assume a branch delay slot
o You can’t find anything to put in the delay slot – 1 o An instr before the branch is put in the delay slot – 0o An instr from the taken side is put in the slot – 0.2o An instr from the not-taken side is put in the slot – 0.8
20
Multicycle Instructions
21
Effects of Multicycle Instructions
• Potentially multiple writes to the register file in a cycle
• Frequent RAW hazards
• WAW hazards (WAR hazards not possible)
• Imprecise exceptions because of o-o-o instr completion
Note: Can also increase the “width” of the processor: handle multiple instructions at the same time: for example, fetch two instructions, read registers for both, execute both, etc.
22
Say, 3instructions are executing in sequence. In IF ID stage, they will be in in_order. But,Due to different length of execution stage,their completion will be in out_of_order
23
Instruction register mdified Starting cycle End cycle1.Mult r1 1 102.Add r3 2 53.Load r9 3 7
So, Add will finish first. Then Load and then Multiplication. If an exception occurs due to multiplication overflow, on 10th cycle, then processor has to resume its execution from Mult instruction again, so, add and load instruction need to be execute again. So , we have to make sure that register fille that is saved already, does not include the effect of ADD and LOAD instructions, which happened after this exception.
So, the processor that allows the register file to be modified in program order will end up providing precise exception.To make sure in order modification of register file, RE-ORDER-BUFFER data structure is used before register write stage.
24
mult r1
add r3
load r9
Re order buffer
So, when ADD completes its execution first, it checks the buffer before writing to register. It sees that ADD is not the oldest instruction. SO, it will look for multiply operation as multiply has to be allowed first to modify R1 before ADD can modify R3.
As instructions come and they pass through IF ID stages n inorder, they create an entry for themselves in the REORDER BUFFER.
25
Precise Exceptions
• On an exception: must save PC of instruction where program must resume all instructions after that PC that might be in the pipeline must be converted to NOPs (other instructions continue to execute and may raise exceptions of their own) temporary program state not in memory (in other words, registers) has to be stored in memory potential problems if a later instruction has already modified memory or registers
• A processor that fulfils all the above conditions is said to provide precise exceptions (useful for debugging and of course, correctness)
26
Dealing with these Effects
• Multiple writes to the register file: increase the number of ports, stall one of the writers during ID, stall one of the writers during WB (the stall will propagate)
• WAW hazards: detect the hazard during ID and stall the later instruction
• Imprecise exceptions: buffer the results if they complete early or save more pipeline state so that you can return to exactly the same state that you left at
27
Slowdowns from Stalls
• Perfect pipelining with no hazards an instruction completes every cycle (total cycles ~ num instructions) speedup = increase in clock speed = num pipeline stages
• With hazards and stalls, some cycles (= stall time) go by during which no instruction completes, and then the stalled instruction completes
• Total cycles = number of instructions + stall cycles
• Slowdown because of stalls = 1/ (1 + stall cycles per instr)
28
Pipelining Limits
A B C
A B C
A B C D E FA B C D E F
Assume that there is a dependence where the final result of thefirst instruction is required before starting the second instruction
Gap between indep instrs: T + Tovh
Gap between dep instrs: T + Tovh
Gap between indep instrs: T/3 + Tovh
Gap between dep instrs: T + 3Tovh
Gap between indep instrs: T/6 + Tovh
Gap between dep instrs: T + 6Tovh
29
Problem 2
• Assume an unpipelined processor where it takes 5ns to go through the circuits and 0.1ns for the latch overhead. What is the throughput for 20-stage and 40-stage pipelines? Assume that the P.O.P and P.O.C in the unpipelined processor are separated by 2ns. Assume that half the instructions do not introduce a data hazard and half the instructions depend on their preceding instruction.
30
Problem 2
• Assume an unpipelined processor where it takes 5ns to go through the circuits and 0.1ns for the latch overhead. What is the throughput for 1-stage, 20-stage and 50-stage pipelines? Assume that the P.O.P and P.O.C in the unpipelined processor are separated by 2ns. Assume that half the instructions do not introduce a data hazard and half the instructions depend on their preceding instruction.
• 1-stage: 1 instr every 5.1ns• 20-stage: first instr takes 0.35ns, the second takes 2.8ns• 50-stage: first instr takes 0.2ns, the second takes 4ns
31
Thank you