View
92
Download
1
Category
Tags:
Preview:
DESCRIPTION
IBM’s Experience on Pipelined Processors [Agerwala and Cocke 1987]. Attributes and Assumptions: Memory Bandwidth at least one word/cycle to fetch 1 instruction/cycle from I-cache 40% of instructions are load/store, require access to D-cache Code Characteristics (dynamic) loads - 25% - PowerPoint PPT Presentation
Citation preview
IBM’s Experience on Pipelined Processors [Agerwala and Cocke 1987]
Attributes and Assumptions: Memory Bandwidth
at least one word/cycle to fetch 1 instruction/cycle from I-cache 40% of instructions are load/store, require access to D-cache
Code Characteristics (dynamic) loads - 25% stores - 15% ALU/RR - 40% branches - 20% 1/3 unconditional (always taken);
1/3 conditional taken;
1/3 conditional not taken
More Statistics and Assumptions
Cache Performance hit ratio of 100% is assumed in the experiments cache latency: I-cache = i; D-cache = d; default: i=d=1 cycle
Load and Branch Scheduling loads:
• 25% cannot be scheduled• 75% can be moved back 1 instruction
branches:• unconditional - 100% schedulable• conditional - 50% schedulable
CPI Calculations I
No cache bypass of reg. file, no scheduling of loads or branches Load Penalty: 2 cycles (0.25*2=0.5) Branch Penalty: 2 cycles (0.2*0.66*2=0.27) Total CPI: 1 + 0.5 + 0.27 = 1.77 CPI
Bypass, no scheduling of loads or branches Load Penalty: 1 cycle (0.25*1=0.25) Total CPI: 1 + 0.25 + 0.27 = 1.52 CPI
CPI Calculations II
Bypass, scheduling of loads and branches Load Penalty:
75% can be moved back 1 => no penalty
remaining 25% => 1 cycle penalty (0.25*0.25*1=0.063)
Branch Penalty:
1/3 Uncond. 100% schedulable => 1 cycle (.2*.33=.066)
1/3 Cond. Not Taken, if biased for NT => no penalty
1/3 Cond. Taken
50% schedulable => 1 cycle (0.2*.33*.5=.033)
50% unschedulable => 2 cycles (.2*.33*.5*2=.066)
Total CPI: 1 + 0.063 + 0.167 = 1.23 CPI
CPI Calculations III Parallel target address generation
90% of branches can be coded as PC relative i.e. target address can be computed without register
access A separate adder can compute (PC+offset) in the decode stage Branch Penalty:
Conditional: Unconditional:
Total CPI: 1 + 0.063 + 0.087 = 1.15 CPI = 0.87 IPC
PC-relative addressing
Schedulable Branch penalty
YES (90%) YES (50%) 0 cycle
YES (90%) NO (50%) 1 cycle
NO (10%) YES (50%) 1 cycle
NO (10%) NO (50%) 2 cycles
Pipelined Depth
Processor Performance = ---------------Time
Program
= ------------------ X ---------------- X ------------Instructions Cycles
Program Instruction
Time
Cycle
(code size) (CPI) (cycle time)
(T/ k +S )
TS
S
T/k
T/k
k-stage pipelined
unpipelined
?
Limitations of Scalar Pipelines
Upper Bound on Scalar Pipeline Throughtput
Limited by IPC = 1
Inefficient Unification Into Single Pipeline
Long latency for each instruction
Performance Lost Due to Rigid Pipeline
Unnecessary stalls
Stalls in an Inorder Scalar Pipeline
B ypassing o f S ta lledInstruc tion
Stalled Instruction
B ackw ard P ropagationo f S ta lling
N ot A llow edInstructions are in order with respect to any one stage i.e. no dynamic reordering
Architectures for Instruction-Level Parallelism
Scalar Pipeline (baseline)
Instruction Parallelism = D
Operation Latency = 1
Peak IPC = 1
12
34
56
IF DE EX WB
1 2 3 4 5 6 7 8 90
TIME IN CYCLES (OF BASELINE MACHINE)
SU
CC
ES
SIV
EIN
ST
RU
CT
ION
S
D
Superpipelined Machine
Superpipelined Execution
IP = DxM
OL = M minor cycles
Peak IPC = 1 per minor cycle (M per baseline cycle)
12
34
5
IF DE EX WB6
1 2 3 4 5 6
major cycle = M minor cycleminor cycle
Superscalar Machines
Superscalar (Pipelined) Execution
IP = DxN
OL = 1 baseline cycles
Peak IPC = N per baseline cycle
IF DE EX WB
123
456
9
78
N
Superscalar and Superpipelined
Superscalar and superpipelined machines of equal degree have roughly the same performance, i.e. if n = m then both have about the same IPC.
Superscalar Parallelism
Operation Latency: 1
Issuing Rate: N
Superscalar Degree (SSD): N
(Determined by Issue Rate)
Superpipeline Parallelism
Operation Latency: M
Issuing Rate: 1
Superpipelined Degree (SPD): M
(Determined by Operation Latency)
Time in Cycles (of Base Machine)0 1 2 3 4 5 6 7 8 9
SUPERPIPELINED
10 11 12 13
SUPERSCALARKey:
IFetchDcode
ExecuteWriteback
Limitations of Inorder Pipelines
CPI of inorder pipelines degrades very sharply if the machine parallelism is increased beyond a certain point, i.e. when NxM approaches average distance between dependent instructions
Forwarding is no longer effective
must stall more often
Pipeline may never be full due to frequent dependency stalls!!
IF DE EX WB
123
456
9
78
What is Parallelism?
WorkT1 - time to complete a computation
on a sequential system
Critical PathT - time to complete the same
computation on an infinitely-parallel system
Average ParallelismPavg = T1 / T
For a p wide system
Tp max{ T1/p, T }
Pavg>>p Tp T1/p
+
+-
*
*2
a b
xy
x = a + b; y = b * 2z =(x-y) * (x+y)
ILP: Instruction-Level Parallelism
ILP is a measure of the amount of inter-dependencies between instructions
Average ILP = no. instruction / no. cyc required
code1: ILP = 1i.e. must execute serially
code2: ILP = 3i.e. can execute at the same time
code1: r1 r2 + 1r3 r1 / 17r4 r0 - r3
code2: r1 r2 + 1r3 r9 / 17r4 r0 - r10
Inter-instruction Dependences Data dependence
r3 r1 op r2 Read-after-Write r5 r3 op r4 (RAW)
Anti-dependencer3 r1 op r2 Write-after-Read r1 r4 op r5 (WAR)
Output dependencer3 r1 op r2 Write-after-Write r5 r3 op r4 (WAW)r3 r6 op r7
Control dependence
Scope of ILP Analysis
r1 r2 + 1r3 r1 / 17r4 r0 - r3r11 r12 + 1r13 r19 / 17r14 r0 - r20
ILP=2
ILP=1
Out-of-order execution permits more ILP to be exploited
Purported Limits on ILP
Weiss and Smith [1984] 1.58Sohi and Vajapeyam [1987] 1.81Tjaden and Flynn [1970] 1.86Tjaden and Flynn [1973] 1.96Uht [1986] 2.00Smith et al. [1989] 2.00Jouppi and Wall [1988] 2.40Johnson [1991] 2.50Acosta et al. [1986] 2.79Wedig [1982] 3.00Butler et al. [1991] 5.8Melvin and Patt [1991] 6Wall [1991] 7Kuck et al. [1972] 8Riseman and Foster [1972] 51Nicolau and Fisher [1984] 90
Flow Path Model of Superscalars
I-cache
FETCH
DECODE
COMMIT
D-cache
BranchPredictor Instruction
Buffer
StoreQueue
ReorderBuffer
Integer Floating-point Media Memory
Instruction
RegisterData
MemoryData
Flow
EXECUTE
(ROB)
Flow
Flow
Superscalar Pipeline Design
Instruction Buffer
Fetch
Dispatch Buffer
Decode
Issuing Buffer
Dispatch
Completion Buffer
Execute
Store Buffer
Complete
Retire
InstructionFlow
Data Flow
Inorder Pipelines
IF
D1
D2
EX
WB
Intel i486
IF IF
D1 D1
D2 D2
EX EX
WB WB
Intel Pentium
U - Pipe V - Pipe
Inorder pipeline, no WAW no WAR (almost always true)
Out-of-order Pipelining 101
• • •
• • •
• • •
• • •IF
ID
RD
WB
INT Fadd1 Fmult1 LD/ST
Fadd2 Fmult2
Fmult3
EX
Program Order
Ia: F1 F2 x F3 . . . . .
Ib: F1 F4 + F5
What is the value of F1? WAW!!!
Out-of-order WB
Ib: F1 “F4 + F5”. . . . . .
Ia: F1 “F2 x F3”
Output Dependences (WAW)
Superscalar Execution Check List
INSTRUCTION PROCESSING CONSTRAINTS
Resource Contention Code Dependences
Control Dependences Data Dependences
True Dependences
Anti-Dependences
Storage Conflicts
(Structural Dependences)
(RAW)
(WAR)
In-order Issue into Diversified Pipelines
• • •
• • •
• • •
• • •
INT Fadd1 Fmult1 LD/ST
Fadd2 Fmult2
Fmult3
RD Fn (RS, RT)
Dest.Reg.
FuncUnit
SourceRegisters
Issue stage needs to check: 1. Structural Dependence 2. RAW Hazard 3. WAW Hazard 4. WAR Hazard
InorderInst.
Stream
Simple Scoreboarding
Scoreboard: a bit-array, 1-bit for each GPR if the bit is not set, the register has valid data if the bit is set, the register has stale data
i.e. some outstanding inst is going to change it
Dispatch in Order: RD Fn (RS, RT)
Complete out-of-order
- update GPR[RD], clear SB[RD]
- else dispatch to FU, set SB[RD]- if SB[RD] is set is set WAW, stall
- if SB[RS] or SB[RT] is set RAW, stall
Scoreboarding ExampleFU Status Scoreboard
Int Fadd FMult FDiv WB R0 R1 R2 R3 R4 r5
t0 i1 x
t1 i2 i1 x x
t2
t3
t4
t5
t6
t7
t8
t9
t10
t11
i1: FDIV R3, R3, R2i2: LD R1, 0(R6)i3: FMUL R0, R1, R2i4: FDIV R4, R3, R1i5 FSUB R5, R0, R3i6 FMUL R3, R3, R1 assume 1 issue/cyc
Scoreboarding ExampleFU Status Scoreboard
Int Fadd FMult FDiv WB R0 R1 R2 R3 R4 r5
t0 i1 i1
t1 i2 i1 i2 i1
t2 i1 i2 i2 i1
t3 i3 i1 i3 i1
t4 i3 i1 i3 i1
t5 i3 i4 i3 i4
t6 i4 i3 i3 i4
t7 i5 i4 i4 i5
t8 i6 i4 i5 i6 i4 i5
t9 i6 i4, i6 i4
t10 i6 i6
t11 i6 i6
Can WAW really happen here?Can we go to multiple issue?Can we go to out-of-order issue?
i1: FDIV R3, R3, R2i2: LD R1, 0(R6)i3: FMUL R0, R1, R2i4: FDIV R4, R3, R1i5 FSUB R5, R0, R3i6 FMUL R3, R3, R1
Out-of-Order Issue
• • •
INT LD/ST
• • •
• • •
• • •InorderInst.
StreamDispatch
RD
WB
EX
ID
IF
FmultFadd
Scoreboarding for Out-of-Order Issue
Scoreboard: one entry per GPR
(what do we need to record?) Dispatch in order: “RD Fn (RS, RT)”
- if FU is busy structural hazard, stall
- if SB[RD] is set is set WAW, stall
- if SB[RS] or SB[RT] is set is set RAW (what to do??) Issue out-of-order: (when?) Complete out-of-order
- update GPR[RD], clear SB[RD]
(what about WAR?)
Scoreboard for Out-of-Order Issue [H&P pp242~251]
Function Unit Status
Name Busy Op Fii Fj Fk Qj Qk Rj Rk
Integer Fn RD RS RT Yes No
FAdd
FMult
LD/ST
Register Results Status (a.k.a Scoreboard)
R0 R1 R2 R3 R4 R5 R6 . . . . . . . .
FU
RD Fn (RS, RT)”
Which FU is computing the new value if not ready?
Which FU is going to update the register?
Scoreboard Management: “RD Fn (RS, RT)”
Status Wait until Bookkeeping
Dispatch not busy (FU) and
not Result (‘RD’)
Busy(FU) yes; Op(FU) Fn;
Fi(FU) ’RD’; Fj(FU) ’RS’; Fk(FU) ’RT’;
Qj(FU) Result(‘RS’); Qk(FU) Result(‘RT’);
Rj(FU) not Qj(FU); Rk(FU) not Qk(FU); Result(’RD’) FU;
Issue (Read operands)
Rj(FU) and Rk(FU) Rj(FU) no; Rk(FU) no;
Qj(FU) 0; Qk(FU) 0;
Execution
Complete
Functional unit done
Write Result
f (( Fj(f)Fi(FU) or Rj(f)==No )
and
( Fk(f)Fi(FU) or Rk(f)==No ))
f ( if Qj(f)==FU then Rj(f) yes);
f ( if Qk(f)==FU then Rk(f) yes);
Result(Fi(FU)) 0; Busy(FU) no;
Legends: FU -- the fxn unit used by the instruction;Fj( X ) -- content of entry Fj for fxn unit X;Result( X ) -- register result status entry for register X;
Scoreboarding Example 1/3Instruction Status
Instruction Dispatch Read Operands
Execution Complete
Write
Result
LD F6, 43 (R2) X X X X
LD F2, 45(R3) X X X
MULTD F0, F2, F4 X
SUBD F8, F6, F2 X
DIVD F10, F0, F6 X
ADDD F6, F8, F2
Function Unit Status
Name Busy Op Fii Fj Fk Qj Qk Rj Rk
Integer (1) Yes LD F2 R3 No
Mult1(10) Yes MULTD F0 F2 F4 Integer No Yes
Mult2(10) No
Add(2) Yes SUBD F8 F6 F2 Integer Yes No
Div(40) Yes DIVD F10 F0 F6 Mult1 No Yes
Register Results Status (a.k.a Scoreboard)
F0 F2 F4 F6 F8 F10 F12 . . . . . . . .
FU Mult1 Integer Add Divide
Scoreboarding Example 2/3Instruction Status
Instruction Dispatch Read Operands
Execution Complete
Write
Result
LD F6, 43 (R2) X X X X
LD F2, 45(R3) X X X X
MULTD F0, F2, F4 X X X
SUBD F8, F6, F2 X X X X
DIVD F10, F0, F6 X
ADDD F6, F8, F2 X X X
Function Unit Status
Name Busy Op Fii Fj Fk Qj Qk Rj Rk
Integer (1) No
Mult1(10) Yes MULTD F0 F2 F4 No No
Mult2(10) No
Add(2) Yes ADDD F6 F8 F2 No No
Div(40) Yes DIVD F10 F0 F6 Mult1 No Yes
Register Results Status (a.k.a Scoreboard)
F0 F2 F4 F6 F8 F10 F12 . . . . . . . .
FU Mult1 Add Divide
Scoreboarding Example 3/3Instruction Status
Instruction Dispatch Read Operands
Execution Complete
Write
Result
LD F6, 43 (R2) X X X X
LD F2, 45(R3) X X X X
MULTD F0, F2, F4 X X X X
SUBD F8, F6, F2 X X X X
DIVD F10, F0, F6 X X X
ADDD F6, F8, F2 X X X X
Function Unit Status
Name Busy Op Fii Fj Fk Qj Qk Rj Rk
Integer (1) No
Mult1(10) No
Mult2(10) No
Add(2) No
Div(40) Yes DIVD F10 F0 F6 No No
Register Results Status (a.k.a Scoreboard)
F0 F2 F4 F6 F8 F10 F12 . . . . . . . .
FU Divide
Limitations of Scoreboarding
Consider a scoreboard processor with infinitely wide datapath
In the best case, how many instructions can be simultaneously outstanding?
Hints no structural hazards can always write a RAW-free code sequence
addi r1,r0,1; addi r2,r0,1; addi r3,r0,1; ……. think about x86 ISA with only 8 registers
Contribution to Register Recycling
9 $34: mul $14 $7, 4010 addu $15, $4, $1411 mul $24, $9, 412 addu $25, $15, $2413 lw $11, 0($25)14 mul $12, $9, 4015 addu $13, $5, $1216 mul $14, $8, 417 addu $15, $13, $1418 lw $24, 0($15)19 mul $25, $11, $2420 addu $10, $10, $2521 addu $9, $9, 122 ble $9, 10, $34
COMPILER REGISTER ALLOCATION
INSTRUCTION LOOPS
Single Assignment, Symbolic Reg.
Map Symbolic Reg. to Physical Reg. Maximize Reuse of Reg.
CODE GENERATION
REG. ALLOCATION
Reuse Same Set of Reg. in Each Iteration
Overlapped Execution of Different Iterations
For (k=1;k<= 10; k++) t += a [i] [k] * b [k] [j] ;
Resolving False Dependences
(2) R3 R5 + 1
Must Prevent (2) from completing • • •(1) R4 R3 + 1
before (1) is dispatched
Stalling: delay Dispatching (or write back) of the 2nd instruction
Copy Operands: Copy not-yet-used operand to prevent being overwritten (WAR)
Register Renaming: use a different register (WAW & WAR)
Must Prevent (2) from completing before (1) completes
(1) R3 R3 op R5
R3
(2) R3 R5 + 1
•••
•••
Register Renaming
Anti and output dependencies are false dependencies
The dependence is on name/location rather than data Given infinite number of registers, anti and output
dependencies can always be eliminated
r3 r1 op r2 r5 r3 op r4
r3 r6 op r7
Renamedr1 r2 / r3
r4 r1 * r5
r8 r3 + r6
r9 r8 - r4
Originalr1 r2 / r3
r4 r1 * r5
r1 r3 + r6
r3 r1 - r4
RenameRegister
File(t0 ... t63)
RenameTable
Hardware Register Renaming
maintain bindings from ISA reg. names to rename registers When issuing an instruction that updates ‘RD’:
allocate an unused rename register TX recording binding from ‘RD’ to TX
When to remove a binding? When to de-allocate a rename register?
ISA namee.g. R12
renameT56
R1 R2 / R3
R4 R1 * R5
R1 R3 + R6
To be continuednext lecture!!
Recommended