of 131/131
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 1 CS104 Computer Organization and Design Datapaths

CS104 Computer Organization and Design

  • View
    0

  • Download
    0

Embed Size (px)

Text of CS104 Computer Organization and Design

pipelines.pptCS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 1
CS104 Computer Organization and Design
Datapaths
Admin
• Homework • Homework 4 out tonight • Due Monday March 26th • Download/check your submissions
• Reading: • Chapter 4 • (Maybe review 1.4)
• Midterm 2 • March 28
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 2
What did we do last week?
• Who can remind us what we did last week?
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 3
What did we do last week?
• Who can remind us what we did last week? • Ski • Go to the beach • Sleep in • Read a book • …
• Ok, but seriously?
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 4
When last I saw you all..
• Last time I was here (Feb 27/29) • Learned basics of logic design
• Gates (And, Or, Nor, …) • Put gates together to make
• Muxes • Adders • Latches • Flip-flops • …
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 5
While I was at HPCA..
• Prof. Lebeck started teaching you all about datapaths • Putting logic together to execute instructions • Started on single-cycle datapath
• We’ll review/continue with single cycle • Then jump into more things!
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 6
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 7
Datapath for MIPS ISA
• Consider only the following instructions add $1,$2,$3 addi $1,2,$3 lw $1,4($3) sw $1,4($3) beq $1,$2,PC_relative_target j absolute_target
• Why only these? • Most other instructions are the same from datapath viewpoint • The one’s that aren’t are left for you to figure out
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 8
Start With Fetch
• PC and instruction memory • A +4 incrementer computes default next instruction PC
P C
Insn Mem
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 9
First Instruction: add
P C
Insn Mem
Register File
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 10
Second Instruction: addi
• Destination register can now be either Rd or Rt • Add sign extension unit and mux into second ALU input
P C
Insn Mem
Register File
S X
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 11
Third Instruction: lw
• Add data memory, address is ALU output • Add register write data mux to select memory output or ALU output
P C
Insn Mem
Register File
S X
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 12
Fourth Instruction: sw
• Add path from second input register to data memory data input
P C
Insn Mem
Register File
S X
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 13
Fifth Instruction: beq
• Add left shift unit and adder to compute PC-relative branch target • Add PC input mux to select PC+4 or branch target
P C
Insn Mem
Register File
S X
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 14
Sixth Instruction: j
• Add shifter to compute left shift of 26-bit immediate • Add additional PC input mux for jump target
P C
Insn Mem
Register File
S X
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 15
“Continuous Read” Datapath Timing
• Works because writes (PC, RegFile, DMem) are independent • And because no read logically follows any write
P C
Insn Mem
Register File
S X
Read IMem Read Registers Read DMEM Write DMEM Write Registers
Write PC
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 16
What Is Control?
• 9 signals control flow of data through this datapath • MUX selectors, or register/memory write enable signals • A real datapath has 300-500 control signals
P C
Insn Mem
Register File
S X
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 17
Example: Control for add
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 18
Example: Control for sw
• Difference between sw and add is 5 signals • 3 if you don’t count the X (don’t care) signals
P C
Insn Mem
Register File
S X
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 19
Example: Control for beq
P C
Insn Mem
Register File
S X
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 20
You all figure LW
P C
Insn Mem
Register File
S X
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 21
Example: Control for LW
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 22
How Is Control Implemented?
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 23
Implementing Control
• Each insn has a unique set of control signals • Most are function of opcode • Some may be encoded in the instruction itself
• E.g., the ALUop signal is some portion of the MIPS Func field + Simplifies controller implementation • Requires careful ISA design
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 24
Control Implementation: ROM
• ROM (read only memory): think rows of bits • Bits in data words are control signals • Lines indexed by opcode • Example: ROM control for 6-insn MIPS datapath • X is “don’t care”
BR JP ALUinB ALUop DMwe Rwe Rdst Rwd
add 0 0 0 0 0 1 0 0
addi 0 0 1 0 0 1 1 0
lw 0 0 1 0 0 1 1 1
sw 0 0 1 0 1 0 X X
beq 1 0 0 1 0 0 X X
j 0 1 0 0 0 0 X X
opcode
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 25
Control Implementation: Random Logic
• Real machines have 100+ insns 300+ control signals • 30,000+ control bits (~4KB) – Not huge, but hard to make faster than datapath (important!)
• Alternative: random logic (random = ‘non-repeating’) • Exploits the observation: many signals have few 1s or few 0s • Example: random logic control for 6-insn MIPS datapath
ALUinB
BR JP DMwe Rwd Rdst ALUop Rwe
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 26
Datapath and Control Timing
Read DMEM Write DMEM Write Registers
Write PC
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 27
Single-Cycle Datapath Performance
• Goes against make common case fast (MCCF) principle + Low Cycles Per Instruction (CPI): 1 – Long clock period: to accommodate slowest insn
P C
Insn Mem
Register File
S X
Interlude: Performance
• Previous slide alludes to something new: Performance • Don’t just want it to work… • But want it to go fast!
• Three components to performance: Number of instructions x Cycles per instruction (CPI) x Clock Period (1 / Clock frequency)
Instructions Cycles Seconds Seconds —————— x ————— x ————— = —————— Program Instruction Cycle Program
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 28
Interlude: Performance
• Three components to performance: Number of instructions <- Compiler’s Job x Cycles per instruction (CPI) x Clock Period (1 / Clock frequency)
Instructions Cycles Seconds Seconds —————— x ————— x ————— = —————— Program Instruction Cycle Program
• Insns/Program: determined by compiler + ISA • Generally assume fixed program when do micro-architecture
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 29
Micro-architectural factors
• Micro-architecture: • The details of how the ISA is implemented • Affects CPI and Clock frequency
• Often will look at fixed program, and consider MIPS • Million Instructions Per Second • MIPS = IPC * Frequency (in MHz) • IPC = Instruction Per Cycle (1 / CPI) • Gives “Bigger is better” number
Instructions Cycles Instructions ————— x ————— = —————— Cycle Second Second (IPC) (Frequency) (Throughput)
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 30
“Best” IPC
• For now, best we can do: IPC = 1 (CPI = 1) • Do 1 instruction every cycle
• Later: • Real processors can do multiple instructions at once! • Potentially: IPC < 1! • Best possible IPC depends on design
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 31
Performance vs ….
• 1990s: Performance at all cost • Actually more “clock frequency” at all cost…
• Now: Care about other things • Energy (electric bill, battery life) • Power (cooling, also affects energy) • Area (chip cost) • Reliability (tolerance of transient faults: e.g., charge particle strikes) • …
• Important metric these days “Performance / Watt” • Throughput divided by power consumption • Why?
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 32
Performance Modeling and Analysis
• Speaking of performance • Making a processor takes time (years) and money (millions) • Want to know it will perform well before you finish
• If its wrong, doing it all over is painful… • Performance can be simulated in software
• Estimate what IPC will be • Guide design
• This is my other job by the way…
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 33
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 34
Single-Cycle Datapath Performance
• Goes against make common case fast (MCCF) principle + Low Cycles Per Instruction (CPI): 1 – Long clock period: to accommodate slowest insn
P C
Insn Mem
Register File
S X
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 35
Alternative: Multi-Cycle Datapath
• Multi-cycle datapath: attacks high clock period • Cut datapath into multiple stages (5 here), isolate using FFs • FSM control “walks” insns thru stages (by staging control signals) + Insns can bypass stages and exit early
P C
Insn Mem
Register File
S X
a
d
+ 4
<< 2
Finite State Machine (FSM)
• FSM = States + Transitions • Next state: function of current state + inputs • Outputs: function of current state + inputs
• Canonical Example: Combination Lock • Must enter 3 8 4 to unlock
• P.S. Useful in software too
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 36
Finite State Machines: Example
• Combination Lock Example: • Need to enter 3 8 4 to unlock
• Initial State: no valid piece of combo seen
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 37
Start
Finite State Machines: Example
• Combination Lock Example: • Need to enter 3 8 4 to unlock
• Input of 3: transition to new state • Any other input: stay in same state CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 38
Start 1 3
Finite State Machines: Example
• Combination Lock Example: • Need to enter 3 8 4 to unlock
• State 1: • Input = 8? Goto state 2
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 39
Start 1 3
Finite State Machines: Example
• Combination Lock Example: • Need to enter 3 8 4 to unlock
• State 2: • Input = 4? Goto state 3
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 40
Start 1 3
Finite State Machines: Example
• Combination Lock Example: • Need to enter 3 8 4 to unlock
• State 3: Unlock!
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 41
Start 1 3
FSM in Hardware
• Flip flop (s) to hold state (s) • Combinatorial logic to determine next state/output • (Assumes FF enable on input_valid)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 42
FSM Hardware Example
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 43
FSM Hardware Example
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 44
FSM Hardware Example
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 45
FSM Hardware Example
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 46
FSM Hardware Example
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 47
FSM Hardware Example
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 48
FSM Hardware Example
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 49
FSM Implementation: ROM
• Just saw: FSM implemented with sum-of-products • Remind us what that is?
• Can also be implemented with a ROM CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 50
2(N+K) Entry ROM
FSM ROM Implementation Example
• Combination Lock (3 8 4) Example • 4-bit input • 2-bit state • 64-entry ROM (indexed with S1 S0 I3 I2 I1 I0)
• Each entry needs 3 bits (S1 S0 U) • 2 for next state • 1 for unlock signal
• Example entries in ROM • 0x00 = 000 • 0x03 = 010 • 0x18 = 100 • 0x13 = 010 • 0x3_ = 001
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 51
Multi-cycle Datapath FSM
• First state: Get a New Instruction • Output signals to fetch (e.g., read enable IMEM) • Next State: Always Decode
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 52
Next Insn
Decode Insn
Multi-cycle Datapath FSM
• Second State: Decode • Output signals to decode instruction (RdEn RegFile) • Go to Next Insn if NOP • Otherwise Execute
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 53
Next Insn
Decode Insn
Execute Insn NOP
Multi-cycle Datapath FSM
• Execute State • Execute Insn (varies by insn type) • Next State: Also depends on insn type
• Branches: Next Insn
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 54
Next Insn
Decode Insn
Multi-cycle Datapath FSM
• Execute State • Execute Insn (varies by insn type) • Next State: Also depends on insn type
• ALU op: write register
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 55
Next Insn
Decode Insn
Multi-cycle Datapath FSM
• Execute State • Execute Insn (varies by insn type) • Next State: Also depends on insn type
• Load: Read Memory
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 56
Next Insn
Decode Insn
Multi-cycle Datapath FSM
• Execute State • Execute Insn (varies by insn type) • Next State: Also depends on insn type
• Store: Write Memory
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 57
Next Insn
Decode Insn
Multi-cycle Datapath FSM
• Read DMEM State • Control signals enable DMEM Read • Next state is writeback
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 58
Next Insn
Decode Insn
Multi-cycle Datapath FSM
• Writeback state • Control signals enable regfile write • Next state: Next Insn
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 59
Next Insn
Decode Insn
Multi-cycle Datapath FSM
• Write DMEM state • Control signals enable memory write • Next state: Next Insn
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 60
Next Insn
Decode Insn
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 61
Multi-Cycle Datapath Example: Add
a
d
+ 4
<< 2
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 62
Multi-Cycle Datapath Example: Add
a
d
+ 4
<< 2
B
A
• Example: Add • Cycle 1: Read IMEM • Cycle 2: Decode + Read RF
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 63
Multi-Cycle Datapath Example: Add
a
d
+ 4
<< 2
B
A
• Example: Add • Cycle 1: Read IMEM • Cycle 2: Decode + Read RF • Cycle 3: ALU
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 64
Multi-Cycle Datapath Example: Add
• Example: Add • Cycle 1: Read IMEM • Cycle 2: Decode + Read RF • Cycle 3: ALU • Cycle 4: Writeback + Increment PC
P C
Insn Mem
Register File
S X
a
d
+ 4
<< 2
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 65
Multi-Cycle Datapath Performance
• Opposite performance split of single-cycle datapath + Short clock period – High CPI
P C
Insn Mem
Register File
S X
a
d
+ 4
<< 2
Multi-cycle Data-path CPI
• CPI depends on instructions • Branches / Jumps: 3 cycles • ALU: 4 cycles • Stores: 4 cycles • Loads: 5 cycles
• Overall CPI is weighted average
• Example: • 20% loads, 15% stores, 20% branches, 45% ALU
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 66
Multi-cycle Data-path CPI
• CPI depends on instructions • Branches / Jumps: 3 cycles • ALU: 4 cycles • Stores: 4 cycles • Loads: 5 cycles
• Overall CPI is weighted average
• Example: • 20% loads, 15% stores, 20% branches, 45% ALU
CPI= 0.20 * 5 +
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 67
Multi-cycle Data-path CPI
• CPI depends on instructions • Branches / Jumps: 3 cycles • ALU: 4 cycles • Stores: 4 cycles • Loads: 5 cycles
• Overall CPI is weighted average
• Example: • 20% loads, 15% stores, 20% branches, 45% ALU
CPI= 0.20 * 5 + 0.15 * 4 +
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 68
Multi-cycle Data-path CPI
• CPI depends on instructions • Branches / Jumps: 3 cycles • ALU: 4 cycles • Stores: 4 cycles • Loads: 5 cycles
• Overall CPI is weighted average
• Example: • 20% loads, 15% stores, 20% branches, 45% ALU
CPI= 0.20 * 5 + 0.15 * 4 + 0.20 * 3 + 0.45 * 4 = 4.0
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 69
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 70
Multi-cycle Datapath Performance
• Single-cycle • Clock period = 50ns, CPI = 1 • Performace = 50 ns/insn
• Multi-cycle • Clock period = 10ns • CPI = (0.2*3+0.2*5+0.6*4) = 4 • Performance = 40 ns/insn
• But wait…
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 71
Multi-Cycle Datapath Performance
• Did not just cut up existing logic into 5 pieces • Also added logic (flip flops)
• So clock period not 1/5 of single cycle, but slightly longer
P C
Insn Mem
Register File
S X
a
d
+ 4
<< 2
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 72
Multi-cycle Datapath Performance
• Single-cycle • Clock period = 50ns, CPI = 1 • Performace = 50 ns/insn
• Multi-cycle • Clock period = 12ns • CPI = (0.2*3+0.2*5+0.6*4) = 4 • Performance = 48 ns/insn
• Better, but not as exciting… • Can we do better still? • Have our cake (low CPI) and eat it too (high clock frequency)?
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 73
Clock Period and CPI
• Single-cycle datapath + Low CPI: 1 – Long clock period: to accommodate slowest insn
• Multi-cycle datapath + Short clock period – High CPI
• Can we have both low CPI and short clock period? – No good way to make a single insn go faster + Insn latency doesn’t matter anyway … insn throughput matters • Key: exploit inter-insn parallelism
insn0.fetch, dec, exec insn1.fetch, dec, exec
insn0.dec insn0.fetch insn1.dec insn1.fetch
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 74
Pipelining
• Pipelining: important performance technique • Improves insn throughput rather than insn latency • Exploits parallelism at insn-stage level to do so • Begin with multi-cycle design
• When insn advances from stage 1 to 2, next insn enters stage 1
• Individual insns take same number of stages + But insns enter and leave at a much faster rate • Physically breaks “atomic” VN loop ... but must maintain illusion
• Automotive assembly line analogy
insn0.dec insn0.fetch insn1.dec insn1.fetch
insn1.exec
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 75
5 Stage Multi-Cycle Datapath
a
d
+ 4
<< 2
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 76
5 Stage Pipelined Datapath
• Temporary values (PC,IR,A,B,O,D) re-latched every stage • Why? 5 insns may be in pipeline at once, they share a single PC? • Notice, PC not latched after ALU stage (why not?)
PC Insn Mem
a
d
+ 4
<< 2
PC
IR
PC
A
B
IR
O
B
IR
O
D
IR
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 77
Pipeline Terminology
PC Insn Mem
a
d
+ 4
<< 2
PC
IR
PC
A
B
IR
O
B
IR
O
D
IR
PC
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 78
Some More Terminology
• Scalar pipeline: one insn per stage per cycle • Alternative: “superscalar” (next unit)
• In-order pipeline: insns enter execute stage in VN order • Alternative: “out-of-order” (not covered in CSE 371)
• Pipeline depth: number of pipeline stages • Nothing magical about five • Trend has been to deeper pipelines
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 79
Pipeline Example: Cycle 1
a
d
+ 4
<< 2
PC
IR
PC
A
B
IR
O
B
IR
O
D
IR
PC
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 80
Pipeline Example: Cycle 2
a
d
+ 4
<< 2
PC
IR
PC
A
B
IR
O
B
IR
O
D
IR
PC
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 81
Pipeline Example: Cycle 3
a
d
+ 4
<< 2
PC
IR
PC
A
B
IR
O
B
IR
O
D
IR
PC
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 82
Pipeline Example: Cycle 4
a
d
+ 4
<< 2
PC
IR
PC
A
B
IR
O
B
IR
O
D
IR
PC
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 83
Pipeline Example: Cycle 5
a
d
+ 4
<< 2
PC
IR
PC
A
B
IR
O
B
IR
O
D
IR
PC
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 84
Pipeline Example: Cycle 6
a
d
+ 4
<< 2
PC
IR
PC
A
B
IR
O
B
IR
O
D
IR
PC
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 85
Pipeline Example: Cycle 7
a
d
+ 4
<< 2
PC
IR
PC
A
B
IR
O
B
IR
O
D
IR
PC
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 86
Pipeline Diagram
• Pipeline diagram: shorthand for what we just saw • Across: cycles • Down: insns • Convention: X means lw $4,0($5) finishes execute stage and
writes into X/M latch at end of cycle 4
1 2 3 4 5 6 7 8 9
add $3,$2,$1 F D X M W
lw $4,0($5) F D X M W
sw $6,4($7) F D X M W
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 87
What About Pipelined Control?
• Should it be like single-cycle control? • But individual insn signals must be staged
• Should it be like multi-cycle control? • But all stages are simultaneously active
• How many different controllers are we going to need? • One for each insn in pipeline?
• Solution: use simple single-cycle control, but pipeline it • Single controller
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 88
Pipelined Control
a
d
+ 4
<< 2
PC
IR
PC
A
B
IR
O
B
IR
O
D
IR
PC
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 89
Pipeline Performance Calculation
• Single-cycle • Clock period = 50ns, CPI = 1 • Performace = 50ns/insn
• Multi-cycle • Branch: 20% (3 cycles), load: 20% (5 cycles), other: 60% (4
cycles) • Clock period = 12ns, CPI = (0.2*3+0.2*5+0.6*4) = 4
• Remember: latching overhead makes it 12, not 10 • Performance = 48ns/insn
• Pipelined • Clock period = 12ns • CPI = 1.5 (on average insn completes every 1.5 cycles) • Performance = 18ns/insn
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 90
Q1: Why Is Pipeline Clock Period …
• … > delay thru datapath / number of pipeline stages?
• Latches (FFs) add delay • Pipeline stages have different delays, clock period is max delay
• Both factors have implications for ideal number pipeline stages
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 91
Q2: Why Is Pipeline CPI…
• … > 1? • CPI for scalar in-order pipeline is 1 + stall penalties • Stalls used to resolve hazards
• Hazard: condition that jeopardizes VN illusion • Stall: artificial pipeline delay introduced to restore VN illusion
• Calculating pipeline CPI • Frequency of stall * stall cycles • Penalties add (stalls generally don’t overlap in in-order pipelines) • 1 + stall-freq1*stall-cyc1 + stall-freq2*stall-cyc2 + …
• Correctness/performance/MCCF • Long penalties OK if they happen rarely, e.g., 1 + 0.01 * 10 = 1.1 • Stalls also have implications for ideal number of pipeline stages
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 92
Dependences and Hazards
• Dependence: relationship between two insns • Data: two insns use same storage location • Control: one insn affects whether another executes at all • Not a bad thing, programs would be boring without them • Enforced by making older insn go before younger one
• Happens naturally in single-/multi-cycle designs • But not in a pipeline
• Hazard: dependence & possibility of wrong insn order • Effects of wrong insn order cannot be externally visible
• Stall: for order by keeping younger insn in same stage • Hazards are a bad thing: stalls reduce performance
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 93
Why Does Every Insn Take 5 Cycles?
• Could /should we allow add to skip M and go to W? No – It wouldn’t help: peak fetch still only 1 insn per cycle – Structural hazards: imagine add follows lw
PC Insn Mem
a
d
+ 4
<< 2
PC
IR
PC
A
B
IR
O
B
IR
O
D
IR
PC
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 94
Structural Hazards
• Structural hazards • Two insns trying to use same circuit at same time
• E.g., structural hazard on regfile write port
• To fix structural hazards: proper ISA/pipeline design • Each insn uses every structure exactly once • For at most one cycle • Always at same stage relative to F
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 95
Data Hazards
• Let’s forget about branches and the control for a while • The three insn sequence we saw earlier executed fine…
• But it wasn’t a real program • Real programs have data dependences
• They pass values via registers and memory
Register File
S X
Data Mem
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 96
Data Hazards
• Would this “program” execute correctly on this pipeline? • Which insns would execute with correct inputs? • add is writing its result into $3 in current cycle – lw read $3 2 cycles ago → got wrong value – addi read $3 1 cycle ago → got wrong value • sw is reading $3 this cycle → OK (regfile timing: write first half)
add $3,$2,$1 lw $4,0($3) sw $3,0($7) addi $6,1,$3
Register File
S X
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 97
Memory Data Hazards
• What about data hazards through memory? No • lw following sw to same address in next cycle, gets right value • Why? DMem read/write take place in same stage
• Data hazards through registers? Yes (previous slide) • Occur because register write is 3 stages after register read • Can only read a register value 3 cycles after writing it
sw $5,0($1) lw $4,0($1)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 98
Fixing Register Data Hazards
• Can only read register value 3 cycles after writing it
• One way to enforce this: make sure programs don’t do it • Compiler puts two independent insns between write/read insn pair
• If they aren’t there already • Independent means: “do not interfere with register in question”
• Do not write it: otherwise meaning of program changes • Do not read it: otherwise create new data hazard
• Code scheduling: compiler moves around existing insns to do this • If none can be found, must use nops
• This is called software interlocks • MIPS: Microprocessor w/out Interlocking Pipeline Stages
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 99
Software Interlock Example add $3,$2,$1 lw $4,0($3) sw $7,0($3) add $6,$2,$8 addi $3,$5,4
• Can any of last three insns be scheduled between first two • sw $7,0($3)? No, creates hazard with add $3,$2,$1 • add $6,$2,$8? OK • addi $3,$5,4? No, lw would read $3 from it • Still need one more insn, use nop
add $3,$2,$1 add $6,$2,$8 nop lw $4,0($3) sw $7,0($3) addi $3,$5,4
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 100
Software Interlock Performance
• Same deal • Branch: 20%, load: 20%, store: 10%, other: 50%
• Software interlocks • 20% of insns require insertion of 1 nop • 5% of insns require insertion of 2 nops
• CPI is still 1 technically • But now there are more insns • #insns = 1 + 0.20*1 + 0.05*2 = 1.3 – 30% more insns (30% slowdown) due to data hazards
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 101
Hardware Interlocks
• Problem with software interlocks? Not compatible • Where does 3 in “read register 3 cycles after writing” come from?
• From structure (depth) of pipeline • What if next MIPS version uses a 7 stage pipeline?
• Programs compiled assuming 5 stage pipeline will break
• A better (more compatible) way: hardware interlocks • Processor detects data hazards and fixes them • Two aspects to this
• Detecting hazards • Fixing hazards
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 102
Detecting Data Hazards
• Compare F/D insn input register names with output register names of older insns in pipeline Hazard =
(F/D.IR.RS1 == D/X.IR.RD) || (F/D.IR.RS2 == D/X.IR.RD) || (F/D.IR.RS1 == X/M.IR.RD) || (F/D.IR.RS2 == X/M.IR.RD)
Register File
S X
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 103
Fixing Data Hazards
• Prevent F/D insn from reading (advancing) this cycle • Write nop into D/X.IR (effectively, insert nop in hardware) • Also reset (clear) the datapath control signals • Disable F/D latch and PC write enables (why?)
• Re-evaluate situation next cycle
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 104
Aside: Insert NOP/Reset Register
• Earlier: registers support separate clock, write enable • Useful for writes into register file • Also useful for implementing stalls
• Registers should also support synchronous reset (clear) • Useful for implementing stalls • Implement as additional hardwired 0 input to FF data mux • Resetting pipeline registers equivalent to inserting a NOP
• If NOP is all zeros • If zero means “don’t write” for all write-enable control signals • Design ISA/control signals to make sure this is the case
FF D Q
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 105
Hardware Interlock Example: cycle 1
(F/D.IR.RS1 == D/X.IR.RD) || (F/D.IR.RS2 == D/X.IR.RD) || (F/D.IR.RS1 == X/M.IR.RD) || (F/D.IR.RS2 == X/M.IR.RD)
= 1
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 106
Hardware Interlock Example: cycle 2
(F/D.IR.RS1 == D/X.IR.RD) || (F/D.IR.RS2 == D/X.IR.RD) || (F/D.IR.RS1 == X/M.IR.RD) || (F/D.IR.RS2 == X/M.IR.RD)
= 1
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 107
Hardware Interlock Example: cycle 3
(F/D.IR.RS1 == D/X.IR.RD) || (F/D.IR.RS2 == D/X.IR.RD) || (F/D.IR.RS1 == X/M.IR.RD) || (F/D.IR.RS2 == X/M.IR.RD)
= 0
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 108
Pipeline Control Terminology
• Hardware interlock maneuver is called stall or bubble
• Mechanism is called stall logic • Part of more general pipeline control mechanism
• Controls advancement of insns through pipeline
• Distinguish from pipelined datapath control • Controls datapath at each stage • Pipeline control controls advancement of datapath control
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 109
Pipeline Diagram with Data Hazards
• Data hazard stall indicated with d* • Stall propagates to younger insns
• This is not good (why?)
1 2 3 4 5 6 7 8 9
add $3,$2,$1 F D X M W
lw $4,0($3) F d* d* D X M W
sw $6,4($7) F D X M W
1 2 3 4 5 6 7 8 9
add $3,$2,$1 F D X M W
lw $4,0($3) F d* d* D X M W
sw $6,4($7) F D X M W
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 110
Hardware Interlock Performance
• Same deal • Branch: 20%, load: 20%, store: 10%, other: 50%
• Hardware interlocks: same as software interlocks • 20% of insns require 1 cycle stall (I.e., insertion of 1 nop) • 5% of insns require 2 cycle stall (I.e., insertion of 2 nops)
• CPI = 1 * 0.20*1 + 0.05*2 = 1.3 • So, either CPI stays at 1 and #insns increases 30% (software) • Or, #insns stays at 1 (relative) and CPI increases 30% (hardware) • Same difference
• Anyway, we can do better
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 111
Observe
• Technically, this situation is broken • lw $4,0($3) has already read $3 from regfile • add $3,$2,$1 hasn’t yet written $3 to regfile
• But fundamentally, everything is OK • lw $4,0($3) hasn’t actually used $3 yet • add $3,$2,$1 has already computed $3
Register File
S X
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 112
Bypassing
• Bypassing • Reading a value from an intermediate (µarchitectural) source • Not waiting until it is available from primary source • Here, we are bypassing the register file • Also called forwarding
Register File
S X
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 113
WX Bypassing
• What about this combination? • Add another bypass path and MUX input • First one was an MX bypass • This one is a WX bypass
Register File
S X
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 114
ALUinB Bypassing
Register File
S X
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 115
WM Bypassing?
• Does WM bypassing make sense? • Not to the address input (why not?) • But to the store data input, yes
Register File
S X
a
d
IR
A
B
IR
O
B
IR
O
D
IR
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 116
Bypass Logic
• Each MUX has its own, here it is for MUX ALUinA (D/X.IR.RS1 == X/M.IR.RD) => 0 (D/X.IR.RS1 == M/W.IR.RD) => 1 Else => 2
Register File
S X
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 117
Bypass and Stall Logic
• Two separate things • Stall logic controls pipeline registers • Bypass logic controls MUXs
• But complementary • For a given data hazard: if can’t bypass, must stall
• Slide #43 shows full bypassing: all bypasses possible • Is stall logic still necessary?
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 118
Yes, Load Output to ALU Input
Stall = (D/X.IR.OP == LOAD) && ((F/D.IR.RS1 == D/X.IR.RD) || ((F/D.IR.RS2 == D/X.IR.RD) && (F/D.IR.OP != STORE))
Register File
S X
a
d
IR
A
B
IR
O
B
IR
O
D
IR
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 119
Pipeline Diagram With Bypassing
• Use compiler scheduling to reduce load-use stall frequency • Like software interlocks, but for performance not correctness
1 2 3 4 5 6 7 8 9
add $3,$2,$1 F D X M W
lw $4,0($3) F D X M W
addi $6,$4,1 F d* D X M W
1 2 3 4 5 6 7 8 9
add $3,$2,$1 F D X M W
lw $4,0($3) F D X M W
sub $8,$3,$1 F D X M W
addi $6,$4,1 F D X M W
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 120
Control Hazards
• Control hazards • Must fetch post branch insns before branch outcome is known • Default: assume “not-taken” (at fetch, can’t tell it’s a branch)
PC Insn Mem
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 121
Branch Recovery
• Branch recovery: what to do when branch is actually taken • Insns that will be written into F/D and D/X are wrong • Flush them, i.e., replace them with nops + They haven’t had written permanent state yet (regfile, DMem)
PC Insn Mem
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 122
Branch Recovery Pipeline Diagram
• Convention: don’t fill in flushed insns • Taken branch penalty is 2 cycles
1 2 3 4 5 6 7 8 9
addi $3,$0,1 F D X M W
bnez $3,targ F D X M W
sw $6,4($7) F D
targ: addi $8,$7,1 F
1 2 3 4 5 6 7 8 9
addi $3,$0,1 F D X M W
bnez $3,targ F D X M W
targ: addi $8,$7,1 F D X M W
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 123
Branch Performance
• Back of the envelope calculation • Branch: 20%, load: 20%, store: 10%, other: 50% • 75% of branches are taken
• CPI = 1 + 0.20*0.75*2 = 1.3 – Branches cause 30% slowdown • How do we reduce this penalty?
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 124
Fast Branch
• Fast branch: can decide at D, not X • Test must be comparison to zero or equality, no time for ALU + New taken branch penalty is 1 – Additional insns (slt) for more complex tests, must bypass to D too • 25% of branches have complex tests that require extra insn • CPI = 1 + 0.20*0.75*1(branch) + 0.20*0.25*1(extra insn) = 1.2
PC Insn Mem
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 125
Speculative Execution
• Speculative execution • Execute before all parameters known with certainty • Correct speculation
+ Avoid stall, improve performance • Incorrect speculation (mis-speculation)
– Must abort/flush/squash incorrect insns – Must undo incorrect changes (recover pre-speculation state)
• The “game”: [%correct * gain] – [(1–%correct) * penalty]
• Control speculation: speculation aimed at control hazards • Unknown parameter: are these the correct insns to execute next?
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 126
Control Speculation Mechanics
• Guess branch target, start fetching at guessed position • Doing nothing is implicitly guessing target is PC+4 • Can actively guess other targets: dynamic branch prediction
• Execute branch to verify (check) guess • Correct speculation? keep going • Mis-speculation? Flush mis-speculated insns
• Hopefully haven’t modified permanent state (Regfile, DMem) + Happens naturally in in-order 5-stage pipeline
• “Game” for in-order 5 stage pipeline • %correct = ? • Gain = 2 cycles + Penalty = 0 cycles → mis-speculation no worse than stalling
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 127
Dynamic Branch Prediction
• Dynamic branch prediction: guess outcome • Start fetching from guessed address • Flush on mis-prediction (notice new recovery circuit)
PC Insn Mem
Branch Prediction: Short Summary
• Key principle of micro-architecture: • Programs do the same thing over and over (why?)
• Exploit for performance: • Learn what a program did before • Guess that it will do the same thing again
• Details of branch prediction: later (~1 month) • For now, just know it can be done and is important to performance
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 128
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 129
Branch Prediction Performance
• Dynamic branch prediction • Simple predictor: branches predicted with 75% accuracy • CPI = 1 + 0.20*0.25*2 = 1.1 • More advanced predictor: 95% accuracy • CPI = 1 + 0.20*0.05*2 = 1.02
• Branch mis-predictions still a big problem though • Pipelines are long: typical mis-prediction penalty is 10+ cycles • Pipelines have full bypassing: compiler schedules the rest • Pipelines are superscalar (later)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 130
Pipelining And Exceptions
• Pipelining makes exceptions nasty • 5 insns in pipeline at once • Exception happens, how do you know which insn caused it?
• Exceptions propagate along pipeline in latches • Two exceptions happen, how do you know which one to take first?
• One belonging to oldest insn • When handling exception, have to flush younger insns
• Piggy-back on branch mis-prediction machinery to do this • What about multi-cycle operations?
• Just FYI
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 131
Pipeline Depth
• No magic about 5 stages, trend had been to deeper pipelines • 486: 5 stages (50+ gate delays / clock) • Pentium: 7 stages • Pentium II/III: 12 stages • Pentium 4: 22 stages (~10 gate delays / clock) “super-pipelining” • Core1/2: 14 stages
• Increasing pipeline depth + Increases clock frequency (reduces period) – But decreases IPC (increases CPI) • Branch mis-prediction penalty becomes longer • Non-bypassed data hazard stalls become longer • At some point, CPI losses offset clock gains, question is when?