View
225
Download
3
Category
Preview:
Citation preview
Lec 9: Pipelining
Kavita Bala
CS 3410, Fall 2008
Computer Science
Cornell University
© Kavita Bala, Computer Science, Cornell University
Basic PipeliningFive stage “RISC” load-store architecture1. Instruction fetch (IF)
• get instruction from memory2. Instruction Decode (ID)
• translate opcode into control signals and read regs3. Execute (EX)
• perform ALU operation4. Memory (MEM)
• Access memory if load/store5. Writeback (WB)
• update register file
Following slides thanks to Sally McKee
© Kavita Bala, Computer Science, Cornell University
Sample Code (Simple)
• Assume eight-register machine
• Run the following code on a pipelined datapath add 3 1 2 ; reg 3 = reg 1 + reg 2
nand 6 4 5 ; reg 6 = ~(reg 4 & reg 5)
lw 4 20 (2) ; reg 4 = Mem[reg2+20]
add 5 2 5 ; reg 5 = reg 2 + reg 5
sw 7 12(3) ; Mem[reg3+12] = reg 7
Slides thanks to Sally McKee
© Kavita Bala, Computer Science, Cornell University
PC Instmem
Reg
iste
r fi
le
MUXA
LU
MUX
1
Datamem
+
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
Bits 0-2
Bits 15-17
op
dest
offset
valB
valA
PC+1PC+1target
ALUresult
op
dest
valB
op
dest
ALUresult
mdata
instru
ction
0
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
Bits 21-23
data
dest
Slides thanks to Sally McKee
© Kavita Bala, Computer Science, Cornell University
Time Graphs
Time: 1 2 3 4 5 6 7 8 9
add
nand
lw
add
sw
fetch decode execute memory writeback
fetch decode execute memory writeback
fetch decode execute memory writeback
fetch decode execute memory writeback
fetch decode execute memory writeback
Slides thanks to Sally McKee
© Kavita Bala, Computer Science, Cornell University
Pipelining Recap• Powerful technique for masking latencies
– Logically, instructions execute one at a time– Physically, instructions execute in parallel
Instruction level parallelism
• Decouples the processor model from the implementation– Interface vs. implementation
• BUT dependencies between instructions complicate the implementation
© Kavita Bala, Computer Science, Cornell University
What Can Go Wrong?
• Data hazards– register reads occur in stage 2 – register writes occur in stage 5 – could read the wrong value if is about to be
written• Control hazards
– branch instruction may change the PC in stage 4
– what do we fetch before that?
© Kavita Bala, Computer Science, Cornell University
Handling Data Hazards
• Detect and Stall– If hazards exist, stall the processor until they
go away– Safe, but not great for performance
• Detect and Forward– If hazards exist, fix up the pipeline to get the
correct value (if possible)– Most common solution for high performance
© Kavita Bala, Computer Science, Cornell University
Handling Data Hazards III:
• Detect– dest of early instrs == source of current
• Forward:– New bypass datapaths route computed data to
where it is needed– New MUX and control to pick the right data
• Beware: Stalling may still be required even in the presence of forwarding
© Kavita Bala, Computer Science, Cornell University
Different type of Hazards
• Use register value in subsequent instruction
add 3 1 2
or 6 3 1
nand 5 3 4
sub 6 3 1
Slides thanks to Sally McKee
© Kavita Bala, Computer Science, Cornell University
Data Hazards: AL ops
add 3 1 2 nand 5 3 4
time
fetch decode execute memory writeback
fetch decode execute memory writeback
add
nand
If not careful, you read the wrong value of R3
© Kavita Bala, Computer Science, Cornell University
Data Hazards: AL ops
add 3 1 2 nand 5 3 4
time
fetch decode execute memory writeback
fetch decode execute memory writeback
add
nand
If not careful, you read the wrong value of R3
© Kavita Bala, Computer Science, Cornell University
Data Hazards: AL ops
add 3 1 2 nand 5 3 4
time
fetch decode execute memory writeback
fetch decode execute memory writeback
add
nand
If not careful, you read the wrong value of R3
© Kavita Bala, Computer Science, Cornell University
Handling Data Hazards: Forwarding
• No point forwarding to decode
• Forward to EX stage
• From output of ALU and MEM stages
© Kavita Bala, Computer Science, Cornell University
Data Hazards: AL ops
add 3 1 2 nand 5 3 4
time
fetch decode execute memory writeback
fetch decode execute memory writeback
add
nand
© Kavita Bala, Computer Science, Cornell University
Data Hazards: AL ops
add 3 1 2 and 8 0 2 nand 5 3 4
time
fetch decode execute memory writeback
fetch decode execute memory writeback
add
add
fetch decode execute memory writebacknand
© Kavita Bala, Computer Science, Cornell University
Forwarding Illustration
Instr.
Order
add $3
add
nand $5 $3
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
© Kavita Bala, Computer Science, Cornell University
Data Hazards: AL ops
add 3 1 2 and 8 0 2and 8 2 0nand 5 3 4
time
fetch decode execute memory writeback
fetch decode execute memory writeback
addadd
fetch decode execute memory writebacknandfetch decode execute memory writebackadd
Write in first half of cycle, read in second half of cycle
© Kavita Bala, Computer Science, Cornell University
Data Hazards: lw
lw 3 12(zero) and 8 0 2 nand 5 3 4
time
fetch decode execute memory writeback
fetch decode execute memory writeback
lw
add
fetch decode execute memory writebacknand
© Kavita Bala, Computer Science, Cornell University
Data Hazards: lw
lw 3 12(zero) and 8 0 2 nand 5 3 4
time
fetch decode execute memory writeback
fetch decode execute memory writeback
lw
add
fetch decode execute memory writebacknand
© Kavita Bala, Computer Science, Cornell University
Data Hazards: lw
lw 3 12(zero) and 8 0 2 nand 5 3 4
time
fetch decode execute memory writebacklw
fetch decode execute memory writebacknand
© Kavita Bala, Computer Science, Cornell University
A tricky case
add 1 1 2
add 1 1 3
add 1 1 4
Instr.
Order
add 1 1 2
AL
UIM Reg DM Reg
add 1 1 3
add 1 1 4
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
© Kavita Bala, Computer Science, Cornell University
Another tricky case
add 0 4 1add 3 0 4
© Kavita Bala, Computer Science, Cornell University
Handling Data Hazards: Summary
• Forward:– New bypass datapaths route computed data to
where it is needed
• Beware: Stalling may still be required even in the presence of forwarding
• For lw– MIPS 2000/3000: a delay slot– MIPS 4000 onwards: stall
But really rely on compiler to treat it like delay slot
© Kavita Bala, Computer Science, Cornell University
Detection
• Detection:– Compare regA with previous DestRegs – Compare regB with previous DestRegs
– regA and regB going to ALU– Previous destregs
Previously in ALU, Memory and Writeback
© Kavita Bala, Computer Science, Cornell University
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamem
+
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
op
dest
offset
valB
valA
PC+1PC+1target
ALUresult
op
dest
valB
op
dest
ALUresult
mdata
instru
ction
0
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
data
dest
Slides thanks to Sally McKee
© Kavita Bala, Computer Science, Cornell University
Sample Code
Which data hazards do you see?
add 3 1 2
nand 5 3 4
add 7 6 3
lw 6 24(3)
sw 6 12(2)
Slides thanks to Sally McKee
© Kavita Bala, Computer Science, Cornell University
Sample Code
Which data hazards do you see?
add 3 1 2
nand 5 3 4
add 7 6 3
lw 6 24(3)
sw 6 12(2)
Slides thanks to Sally McKee
© Kavita Bala, Computer Science, Cornell University
Hazard
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamem
+
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
add
7
14
12
nan
d 5 3 4
7 10 1177
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
data
3
fwd fwd fwd
3
First half of cycle 3
Slides thanks to Sally McKee
© Kavita Bala, Computer Science, Cornell University
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamem
+
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
nand
11
10
23
21
add
add
7 6 3
7 10 11 77
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
5data
H1
3
End of cycle 3
Slides thanks to Sally McKee
© Kavita Bala, Computer Science, Cornell University
New Hazard
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamem
+
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
nand
11
10
23
21
add
add
7 6 3
7 10 11 77
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
5data
3MUX
H1
3
First half of cycle 4
21
11
Slides thanks to Sally McKee
© Kavita Bala, Computer Science, Cornell University
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamem
+
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
add
10
1
34
-2
nand
add
21
lw 6 24(3)
7 10 1177
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
75 3data
MUX
H2 H1
End of cycle 4
Slides thanks to Sally McKee
© Kavita Bala, Computer Science, Cornell University
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamem
+
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
add
10
1
34
-2
nand
add
21
lw 6 24(3)
7 10 1177
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
75 3data
MUX
H2 H1
First half of cycle 5
3
21
1
Slides thanks to Sally McKee
© Kavita Bala, Computer Science, Cornell University
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamem
+
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
lw
24
21
4 5
22
add
nand
-2
sw 6 12(2)
7 21 11 77
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
7 5data
MUX
H2 H1
6
End of cycle 5
Slides thanks to Sally McKee
© Kavita Bala, Computer Science, Cornell University
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamem
+
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
lw
24
21
4 5
22
add
nand
-2
sw 6 12(2)
7 21 11 77
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
67 5
data
MUX
H2 H1
First half of cycle 6
Hazard
6
en
en
L
Slides thanks to Sally McKee
© Kavita Bala, Computer Science, Cornell University
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamem
+
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
5
31
lw
add
22
sw 6 12(2)
7 21 11 -2
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
6 7data
MUX
H2
End of cycle 6
nop
Slides thanks to Sally McKee
© Kavita Bala, Computer Science, Cornell University
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamem
+
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
nop
5
31
lw
add
22
sw 6 12(2)
7 21 11 -2
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
6 7data
MUX
H2
First half of cycle 7
Hazard
6
Slides thanks to Sally McKee
© Kavita Bala, Computer Science, Cornell University
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamem
+
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
sw
12
7
1
5
nop
lw
99
7 21 11 -2
14
1
0
22
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
6data
MUX
H3
End of cycle 7
Slides thanks to Sally McKee
6
© Kavita Bala, Computer Science, Cornell University
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamem
+
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
sw
12
7
1
5
nop
lw
99
7 21 11 -2
140
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
data
MUX
H3
First half of cycle 8
99
12
Slides thanks to Sally McKee
6 99
© Kavita Bala, Computer Science, Cornell University
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamem
+
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
12
7
1
5
111
sw
nop
7 21 11 -2
140
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
data
MUX
H3
End of cycle 8
Slides thanks to Sally McKee
6 99
© Kavita Bala, Computer Science, Cornell University
Control Hazards
beqz 3 goToZero nand 5 1 4add 6 7 8
time
fetch decode execute memory writebackbeqz
fetch decode execute memory writebacknand
© Kavita Bala, Computer Science, Cornell University
Control Hazards for 4-stage pipeline
beqz 3 goToZero nand 5 1 4add 6 7 8
time
F/D execute memory writebackbeqz
F/D execute memory writebacknand
© Kavita Bala, Computer Science, Cornell University
Control Hazards• Stall
– Inject NOPs into the pipeline when the next instruction is not known
– Pros: simple, clean; Cons: slow
• Delay Slots– Tell the programmer that the N instructions after a jump will
always be executed, no matter what the outcome of the branch– Pros: The compiler may be able to fill the slots with useful
instructions; Cons: breaks abstraction boundary
• Speculative Execution– Insert instructions into the pipeline– Replace instructions with NOPs if the branch comes out opposite
of what the processor expected– Pros: Clean model, fast; Cons: complex
© Kavita Bala, Computer Science, Cornell University
Control Hazards for 4-stage pipeline
beqz 3 goToZero nand 5 1 4add 6 7 8
time
F/D execute memory writebackbeqz
F/D execute memory writebacknand
goToZero: lui 2 1
F/D execute memory writebacklui
Delay slot!
© Kavita Bala, Computer Science, Cornell University
Hazard summary
• Data hazards– Forward – Stall for lw/sw
• Control hazard– Delay slot
© Kavita Bala, Computer Science, Cornell University
Pipelining for Other Circuits• Pipelining can be applied to any combinational
logic circuit
• What’s the circuit latency at 2ns. per gate?• What’s the throughput?• What is the stage breakdown? Where would you
place latches?• What’s the throughput after pipelining?
Recommended