Upload
vanthien
View
216
Download
0
Embed Size (px)
Citation preview
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 1
Enhancing Performance by Pipelining
COE608: Computer Organization and Architecture
Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan
Electrical and Computer Engineering
Ryerson University
Overview • Introduction to Pipelining
♦ Laundry Example • Pipelining and CPU Performance
♦ Pipelining Hazards ♦ Hazard Solutions
• Pipelined Datapath and Control ♦ Stalling, Forwarding and Flushing
Chapter 4 (Sections 4.5, 4.6, 4.7, 4.8 and part of 4.10/4.11) of Text
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 2
Introduction
Laundry Example Consider Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 30 minutes “Folder” takes 30 minutes “Stasher” takes 30 minutes to put clothes into drawers
A B C D
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 3
Sequential Laundry
Sequential laundry takes 8 hours for 4 loads
30Task
Order
B
CD
ATime
30 30 3030 30 3030 30 30 3030 30 30 3030
6 PM 7 8 9 10 11 12 1
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 4
Pipelined Laundry Start work ASAP
• Multiple tasks operating simultaneously using
different resources.
Task
Order
12 2 AM6 PM 7 8 9 10 11 1
Time
BC
D
A3030 30 3030 30 30
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 5
Pipelined Laundry
• Pipelining doesn’t reduce latency of a single task. It improves throughput of the entire workload
• Potential speedup = Number of (pipe) stages • Pipeline rate limited by the slowest pipeline stage • Unbalanced lengths of pipe stages reduce the
speedup • Time to “fill” pipeline and time to “drain” it
reduces the speedup. • Stall due to dependencies??
6 PM 7 8 9Tim e
BC
D
A3030 30 3030 30 30
Task
Order
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 6
Pipelining in the CPU Improve performance by increasing instruction throughput. Five Natural Stages in an Instruction Execution
Load Word Instruction Stages
• Ifetch: Instruction Fetch Fetch instruction from Instruction Memory
• Reg/Dec: Registers Fetch and Instruction Decode
• Exec: Calculate the memory address
• Mem: Read data from the Data Memory
• Wr: Write data back to the register file
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Ifetch Reg/Dec Exec Mem Wr
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 7
CPU Pipelining
Ideal speedup is number of stages in the pipeline.
What makes pipelining easy? • When all instructions are of the same length. • Few instruction formats. • Memory operands appear only in loads and stores.
What makes pipelining hard? • Structural Hazards: • Control Hazards: • Data Hazards: An instruction depends on a
previous instruction.
Instruction�fetch Reg ALU Data�
access Reg
8 nsInstruction�
fetch Reg ALU Data�access Reg
8 nsInstruction�
fetch
8 ns
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 4 6 8 10 12 14 16 18
2 4 6 8 10 12 14
...
Program�execution�order�(in instructions)
Instruction�fetch Reg ALU Data�
access Reg
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 ns Instruction�fetch Reg ALU Data�
access Reg
2 nsInstruction�
fetch Reg ALU Data�access Reg
2 ns 2 ns 2 ns 2 ns 2 ns
�
Program�execution�order�(in instructions)
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 8
Single Cycle, Multiple Cycle vs. Pipeline
Suppose we execute 100 instructions • Single Cycle Machine
45 ns/cycle x 1 CPI x 100 inst = 4500 ns • Multicycle Machine
10 ns/cycle x 4.6 CPI (mix of inst) x 100 inst = 4600 ns • Ideal Pipelined Machine
10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns
Cycle 1
Ifetch Reg Exec Mem Wr
Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
Ifetch Reg Exec MemLoad Store
Clk
Single Cycle Implementation:Load Store Waste
IfetchR-type
Load Ifetch Reg Exec Mem Wr
Ifetch Reg Exec MemStore
Ifetch Reg Exec WrR-type
Cycle 1 Cycle 2
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 9
Pipeline Hazards • Structural Hazards: attempt to use the same
resource two different ways at the same time
• Data Hazards: attempt to use an item before it is ready Instruction depends on the results of a prior instruction still in the pipeline
• Control Hazards: attempt to make a decision before condition is evaluated
Branch instructions The hazards can always be resolved by waiting.
Pipeline control must detect the hazards and take actions (or delay action) to resolve hazards.
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 10
Hazards
Structural Hazard: Single Memory
Control Hazard
Mem
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Instr 3
Instr 4
AL
UMem Reg Mem Reg
AL
UMem Reg Mem RegA
LUMem Reg Mem Reg
AL
UReg Mem Reg
AL
UMem Reg Mem Reg
Instr.
Order
Time (clock cycles)
Add
Beq
Load
AL
UMem Reg Mem Reg
AL
UMem Reg Mem Reg
AL
UReg Mem RegMem
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 11
Control Hazard Solutions
Stall It is possible to move the decision upward to 2nd stage by adding hardware that checks the registers as being read.
• Predict: guess one branch direction then backup if wrong.
Instr.
Order
Time (clock cycles)
Add
Beq
Load
AL
UMem Reg Mem Reg
AL
UMem Reg Mem Reg
Mem
AL
UReg Mem Reg
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 12
Control Hazard Solutions
Redefine branch behavior (takes place after next instruction) “delayed branch”
• Impact: No clock cycle is wasted if one can find an instruction to put in “slot”
Instr.
Order
Time (clock cycles)
Add
Beq
Misc
AL
UMem Reg Mem RegA
LUMem Reg Mem Reg
Mem
AL
UReg Mem Reg
Load Mem
AL
UReg Mem Reg
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 13
Data Hazard
Data Hazard in r1 Dependencies backwards in time are hazards
“Forward” result from one stage to another.
Instr.
Order
Time (clock cycles)
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
IF ID/RF EX MEM WBAL
UIm Reg Dm Reg
AL
UIm Reg Dm Reg
AL
UIm Reg Dm Reg
Im
AL
UReg Dm Reg
AL
UIm Reg Dm Reg
Instr.
Order
Time (clock cycles)
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
IF ID/RF EX MEM WBAL
UIm Reg Dm Reg
AL
UIm Reg Dm Reg
AL
UIm Reg Dm Reg
Im
AL
UReg Dm Reg
AL
UIm Reg Dm Reg
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 14
Pipelining R-type and Load
Structural hazard: • Two instructions try to write to the register file
at the same time.
Load uses Register File’s Write Port in 5th stage.
R-type uses Register File’s Write Port in 4th stage.
Solution ???
Clock
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Ifetch Reg/Dec Exec WrR-type
Ifetch Reg/Dec Exec WrR-type
Ifetch Reg/Dec Exec Mem WrLoad
Ifetch Reg/Dec Exec WrR-type
Ifetch Reg/Dec Exec WrR-type
We have a problem!
Ifetch Reg/Dec Exec Mem WrLoad1 2 3 4 5
Ifetch Reg/Dec Exec WrR-type1 2 3 4
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 15
Pipelining R-type and Load
Insert “Bubble” into the Pipeline
Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle.
• The control logic can be complex. • Lose instruction fetch and issue opportunity.
Clock
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Ifetch Reg/Dec Exec WrR-type
Ifetch Reg/Dec Exec
Ifetch Reg/Dec Exec Mem WrLoad
Ifetch Reg/Dec Exec WrR-typeIfetch Reg/Dec Exec WrR-type Pipeline
Bubble
Ifetch Reg/Dec Exec Wr
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 16
Pipelining R-type and Load
Delay R-type’s register-write by one cycle: • Now R-type instructions also use Reg File’s
write port at Stage 5. • Mem stage is a NOOP stage: nothing is
being done.
Clock
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Ifetch Reg/Dec Mem WrR-type
Ifetch Reg/Dec Mem WrR-type
Ifetch Reg/Dec Exec Mem WrLoad
Ifetch Reg/Dec Mem WrR-type
Ifetch Reg/Dec Mem WrR-type
Ifetch Reg/Dec Exec WrR-type Mem
Exec
Exec
Exec
Exec
1 2 3 4 5
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 17
Stages in the Datapath What do we need to add for splitting a datapath into stages?
Instruction�memory
Address
4
32
0
Add Add�result
Shift�left 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
M�u�x
0
1
Add
PC
0Write�data
M�u�x
1Registers
Read�data 1
Read�data 2
Read�register 1
Read�register 2
16Sign�
extend
Write�register
Write�data
Read�data
1
ALU�result
M�u�x
ALUZero
ID/EX
Data�memory
Address
Instruction�memory
Address
4
32
0
Add Add�result
Shift�left 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
M�u�x
0
1
Add
PC
0
Address
Write�data
M�u�x
1Registers
Read�data 1
Read�data 2
Read�register 1
Read�register 2
16Sign�
extend
Write�register
Write�data
Read�data
Data�memory
1
ALU�result
M�u�x
ALUZero
ID/EX
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 18
Pipelined Datapath and Control
PC
Instruction�memory
Address
Inst
ruct
ion
Instruction�[20– 16]
MemtoReg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction�[15– 0]
0
0Registers
Write�register
Write�data
Read�data 1
Read�data 2
Read�register 1
Read�register 2
Sign�extend
M�u�x
1Write�data
Read�data M�
u�x
1
ALU�control
RegWrite
MemRead
Instruction�[15– 11]
6
IF/ID ID/EX EX/MEM MEM/WB
MemWrite
Address
Data�memory
PCSrc
Zero
Add Add�result
Shift�left 2
ALU�result
ALUZero
Add
0
1
M�u�x
0
1
M�u�x
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 19
Pipeline Control
Main Control generates the control signals during Instruction Decode • Exec (ExtOp, ALUSrc, ...) control signals are used 1 cycle later. • Mem (MemWr Branch) control signals are used 2 cycles later. • Wr (MemtoReg MemWr) control signals are used 3 cycles later.
Pass control signals along just like the data
Execution/Address Calculation stage control lines
Memory access stage control lines
stage control lines
InstructionReg Dst
ALU Op1
ALU Op0
ALU Src Branch
Mem Read
Mem Write
Reg write
Mem to Reg
R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X
IF/ID R
egister
ID/E
x Register
Ex/M
em R
egister
Mem
/Wr R
egister
Reg/Dec Exec Mem
ExtOp
ALUOpRegDst
ALUSrc
BranchMemWr
MemtoRegRegWr
MainControl
ExtOp
ALUOpRegDst
ALUSrc
MemtoRegRegWr
MemtoRegRegWr
MemtoRegRegWr
BranchMemWr Branch
MemWr
Wr
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 20
Datapath with Control
PC
Instruction�memory
Inst
ruct
ion
Add
Instruction�[20– 16]
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction�[15– 0]
0
0
M�u�x
0
1
Add Add�result
RegistersWrite�register
Write�data
Read�data 1
Read�data 2
Read�register 1
Read�register 2
Sign�extend
M�u�x
1
ALU�result
Zero
Write�data
Read�data
M�u�x
1
ALU�control
Shift�left 2
Reg
Writ
e
MemRead
Control
ALU
Instruction�[15– 11]
6
EX
M
WB
M
WB
WBIF/ID
PCSrc
ID/EX
EX/MEM
MEM/WB
M�u�x
0
1
Mem
Writ
e
AddressData�
memory
Address
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 21
Dependencies Problem: When starting next instruction before the first is completed. Dependencies that “go backward in time” cause data hazards
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Program�execution�order�(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of �register $2:
DM Reg
Reg
Reg
Reg
DM
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 22
Software Solution Have compiler guarantee no hazards Where do we insert the “noops” ?
sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) Problem: this really slows down the application!
Use temporary results and don’t wait for them to be written into register file Forwarding to handle read/write to same register ALU forwarding
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Program�execution order�(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2 :
DM Reg
Reg
Reg
Reg
X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :
DM
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 23
Forwarding
Can't always forward
An instruction tries to read a register following a load instruction that writes to the same register.
PC Instruction�memory
Registers
M�u�x
M�u�x
Control
ALU
EX
M
WB
M
WB
WB
ID/EX
EX/MEM
MEM/WB
Data�memory
M�u�x
Forwarding�unit
IF/ID
Inst
ruct
ion
M�u�x
RdEX/MEM.RegisterRd
MEM/WB.RegisterRd
Rt
Rt
Rs
IF/ID.RegisterRd
IF/ID.RegisterRt
IF/ID.RegisterRt
IF/ID.RegisterRs
Reg
IM
Reg
Reg
IM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
lw $2, 20($1)
Program�execution�order�(in instructions)
and $4, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
DM Reg
Reg
Reg
DM
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 24
Stalling A hazard detection unit to “stall” load instruction Stall the pipeline by keeping an instruction in the same stage
PC Instruction�memory
Registers
M�u�x
M�u�x
M�u�x
Control
ALU
EX
M
WB
M
WB
WB
ID/EX
EX/MEM
MEM/WB
Data�memory
M�u�x
Hazard�detection�
unit
Forwarding�unit
0
M�u�x
IF/ID
Inst
ruct
ion
ID/EX.MemRead
IF/ID
Writ
e
PC
Writ
e
ID/EX.RegisterRt
IF/ID.RegisterRd
IF/ID.RegisterRtIF/ID.RegisterRtIF/ID.RegisterRs
RtRs
Rd
Rt EX/MEM.RegisterRd
MEM/WB.RegisterRd
lw $2, 20($1)
Program�execution�order�(in instructions)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
Reg
IM
Reg
Reg
IM DM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)
IM Reg DM RegIM
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9 CC 10
DM Reg
RegReg
Reg
bubble
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 25
Branch Hazards When decide to branch, instructions are in pipeline We predict “branch not taken”
Need to add hardware for flushing instructions if we are wrong.
Reg
Reg
CC 1
Time (in clock cycles)
40 beq $1, $3, 7
Program�execution�order�(in instructions)
IM Reg
IM DM
IM DM
IM DM
DM
DM Reg
Reg Reg
Reg
Reg
RegIM
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
Reg
PC Instruction�memory
4
Registers
M�u�x
M�u�x
M�u�x
ALU
EX
M
WB
M
WB
WB
ID/EX
0
EX/MEM
MEM/WB
Data�memory
M�u�x
Hazard�detection�
unit
Forwarding�unit
IF.Flush
IF/ID
Sign�extend
Control
M�u�x
=
Shift�left 2
M�u�x
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 26
Improving Performance
Try and avoid stalls. e.g. reorder instructions.
lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1)
Add a “branch delay slot” • Always execute next instruction after a branch. • Rely on compiler to “fill” the slot with something
useful rather than NOOPS
Dynamic Scheduling The hardware performs the “scheduling”
• Hardware tries to find instructions to execute • Out of order execution is possible • Speculative execution and dynamic branch
prediction
Modern processors are very complicated • Alpha 21264: 9 stage pipeline, 6 instruction issue • PowerPC and Pentium: Keeps branch history table.
Compiler technology important
Superscalar Architecture Start more than one instruction in the same cycle
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 27
Nondelayed vs. Delayed Branch
Exit:
Fill Slot
Exit:
add $1 ,$2,$3
sub $4, $5,$6
beq $1, $4, Exit
or $8, $9 ,$10
xor $10, $1,$11
add $1 ,$2,$3
sub $4, $5,$6
beq $1, $4, Exit
or $8, $9 ,$10
xor $10, $1,$11
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 28
Superscalar Architecture
An Example: Independent integer and FP issue to separate pipelines
Single Issue Total Time = Int Exec Time + FP Exec Time
Max Speedup: Total Time MAX(Int Exec Time, FP Exec Time)
Operand / Result Busses
Int Reg Inst Issueand Bypass
FP Reg
Int. Unit
I-Cache
Load / Store Unit
FP Add FP Mul
D-Cache
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 29
Superscalar Laundry: Parallel per stage
More resources, HW to match mix of parallel tasks?
T a s k
Or d e r
BCD
A
E
F
(light clothing) (dark clothing) (very dirty clothing)
(light clothing) (dark clothing) (very dirty clothing)
126 PM 7 8 9 10 11
Time 3030 30 30 30
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 30
Superscalar Laundry: Mismatch Mix
Task mix underutilizes extra resources
T a s k
O r d e r
(light clothing)
(light clothing) (dark clothing)
(light clothing)
A
B
D
C
126 PM 7 8 9 10 11
Time 3030 30 30 30 30 30
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 31
General Superscalar Organization
Super-Pipelined • Many pipeline stages need less than half
a clock cycle. • Double internal clock speed gets two
tasks per external clock cycle • Superscalar allows parallel fetch and
execute
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 32
Superscalar Execution
• Simultaneously fetch multiple instructions • Logic to determine true dependencies involving
register values. • Mechanisms to communicate these values. • Mechanisms to initiate multiple instructions in
parallel. • Resources for parallel execution of multiple
instructions. • Mechanisms for committing process state in
correct order.
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 33
The PowerPC 601 Architecture
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 34
Intel Itanium Microprocessor