View
3
Download
0
Category
Preview:
Citation preview
12.1
CS356 Unit 12b
Advanced Processor Organization
12.2
Goals
• Understand the terms and ideas used in a modern, high-performance processor
• Various systems have different kinds of processors and you should understand the pros and cons of each kind of processor
• Terms to listen for and understand the concept:– Superscalar/multiple issue, loop unrolling, register
renaming, out-of-order execution, speculation, and branch prediction
12.3
A New Instruction
• In x86, we often perform– cmp %rax, %rbx
– je L1 or jne L1
• Many instruction sets have a single instruction that both compares and jumps (limited to registers only)– je %rax, %rbx, L1
– jne %rax, %rbx, L1
• Let us assume x86 supports such an instruction in our subsequent discussion
12.4
INSTRUCTION LEVEL PARALLELISM
12.5
Have We Hit The Limit
• Under ideal circumstances, pipeline would allow us to achieve a throughput (IPC = Instructions per clock) of 1
• Can we do better? Can we execute more than one instruction per clock?– Not with a single pipeline
– But what if we had multiple "pipelines"
– What if we fetched multiple instructions per clock and let them run down the pipeline in parallel
• Let's exploit parallelism!
12.6
Exploiting Parallelism
• With increasing transistor budgets of modern processors (i.e., can do more things at the same time) the question becomes how do we find enough useful tasks to increase performance, or, put another way, what is the most effective ways of exploiting parallelism!
• Many types of parallelism available– Instruction Level Parallelism (ILP): Overlapping instructions within a
single process/thread of execution
– Thread Level Parallelism (TLP): Overlap execution of multiple processes/threads
– Data Level Parallelism (DLP): Overlap an operation (instruction) that is to be applied independently to multiple data values (usually, an array)
for(i=0; i < MAX; i++) { A[i] = A[i] + 5; }• We'll focus on ILP in this unit
12.7
ld 0(%r8), %r9and %r10, %r11or %r11, %r13sub %r14, %r15add %r10, %r12je $0,%r12,L1xor %r15, %rax
Instruction Level Parallelism (ILP)
• Although a program defines a sequential ordering of instructions, in reality many instructions can be executed in parallel.
• ILP refers to the process of finding instructions from a single program/thread of execution that can be executed in parallel
• Data flow (data dependencies) is what truly limits ordering– We call these dependencies RAW (Read-After-Write) Hazards
• Independent instructions can be parallelized
• Control hazards also provide ordering constraints
ld %r8,0(%r9) / and %r10,%r11 / sub %r14,%r15 / add %r10, %r12 / or %r11,%r13 / / je $0, %r12, L1 / / xor %r15, %rax /
Cycle 1:Cycle 2:Cycle 3:
LD ADD SUB AND
JE
XOR
ORDependency
Graph
write %r11
read %r11
write %r15
read %r15
write %r12
read %r12
12.8
Basic Blocks
• Basic Block (def.) = Sequence of instructions that will always be executed together– No conditional branches out
– No branch targets coming in
– Also called "straight-line" code
– Average size: 5-7 instrucs.
• Instructions in a basic block can be overlapped if there are no data dependencies
• Control dependencies really limit our window of possible instructions to overlap– Without extra hardware, we can only overlap execution of
instructions within a basic block
This is a basic block (starts w/ target, ends with branch)
ld 0(%r8),%r9 and %r10,%r11L1: add %r8,%r12 or %r11,%r13 sub %r14,%r10 jeq %r12,%r14,L1 xor %r10,%r15
12.9
Superscalar
• When airplanes broke the sound barrier we said they were super-sonic
• When a processor (HW) can complete more than 1 instruction per clock cycle we say its are super-scalar
• Problem: The HW can execute 2 or more instructions during the same cycle but the SW may be written and compiled assuming 1 instruction executing at a time.
• Solutions– Static Scheduling: Recompile the code and rely on the
compiler to safely order instructions that can be run in parallel
– Dynamic Scheduling: Build the HW to be smart, reorder instructions on the fly while guaranteeing correctness
This Photo by Unknown Author is licensed under CC BY-NC-ND
12.10
Superscalar (Multiple Issue)
• Multiple "pipelines" that can fetch, decode, and potentially execute more than 1 instruction per clock– k-way superscalar
Ability to complete up to k instructions per clock cycle
• Benefits– Theoretical throughput greater than 1 (IPC > 1)
• Problems– Hazards
• Dependencies between instructions limiting parallelism
• Branch/jump requires flushing all pipelines
– Finding enough parallel instructions
12.11
Data Flow and Dependency Graphs
• The compiler produces a sequential order of instructions
• Modern processors will transform the sequential order to execute instructions in parallel
• Instructions can be executed in any valid topological ordering of the dependency graph
LD
ADD SUB AND
JE
XOR
OR
ld 0(%r8), %r9and %r9, %r11or %r11, %r13sub %r14, %r15add %r10, %r12je $0,%r12,L1xor %r15, %r9
12.12
STATIC MULTIPLE ISSUE MACHINESCompiler-based solutions
12.13
Static Multiple Issue
• Compiler is responsible for finding and packaging instructions that can execute in parallel into issue packets– Only certain combinations of instructions can be in a packet together
– Instruction packet example with 3 instructions/packet:• (1) Integer/Branch instruction slot
• (1) LD/ST instruction
• (1) FP operation
• An issue packet is often thought of as an LONG instruction containing multiple instructions(a.k.a. Very Long Instruction Word)– Intel’s Itanium used this technique (static multiple issue) but called it
EPIC (Explicitly Parallel Instruction Computer)
12.14
Example 2-way VLIW machine
• One issue slot for INT/BRANCH operations & another for LD/ST instructions
• I-Cache reads out an entire issue packet (more than 1 instruction)
• HW is added to allow many registers to be accessed at one time – Just more multiplexers
• Address Calculation Unit (just a simple adder)
I-CacheD-Cach
e
ALUReg.File(4
Read, 2 Write)
PC
dr.Calc.
Issue Packet =
More than 1 instruction
Inte
ger S
lot
LD
/ST
Slo
t
I-Cache
D-Cache
ALU
Reg.File
PC
AddrCalc
Inte
ger S
lot
LD
/ST
Slo
t
INT/BRANCH
LD/ST
add %rcx,%rax
ld 8(%rdi),%rdx
12.15
2-way VLIW Scheduling
1. No forwarding w/in an issue packet (between instructions in a packet)
2. Full forwarding to previous instructions– Those behind in the pipeline
3. Still 1 stall cycle necessary when LD is followed by a dependent instr.
I-CacheD-Cach
e
ALU
Reg.File
PC
AddrCalc
VLIW (issue
packet)
Inte
ger S
lot
LD
/ST
Slo
t
add %rcx,%rax
st %rax,0(%rdi)
sub %rax,%rbx
ld 0(%rdi),%rcx
or %rcx,%rdx
or %rcx,%rdx
ld 0(%rdi),%rcx
2
1
3
3
This Photo by Unknown Author is licensed under CC BY-SA
12.16
Sample Scheduling
• Schedule the following loop body on our 2-way static issue machine
# %rdi = A# %esi = n = # of iterationsL1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) add $4,%rdi add $-1,%esi jne $0,%esi,L1
void f1(int *A, int n) { for( ; n != 0; n--, A++) *A += 5; }
Int./Branch Slot LD/ST Slot
ld 0(%rdi),%r9
add $-1,%esi
add $5,%r9
add $4,%rdi st %r9,0(%rdi)
jne $0,%esi,L1
Int./Branch Slot LD/ST Slot
add $-1,%esi ld 0(%rdi),%r9
add $4,%rdi
add $5,%r9
jne $0,%esi,L1 st %r9,-4(%rdi)
w/o modifying original code but with code movementIPC = 6 instrucs. / 5 cycles = 1.2
w/ modifications and code movementIPC = 6 instrucs. / 4 cycle = 1.5
time
12.17
Annotated Example
I-CacheD-Cach
e
ALUReg.File(4
Read, 2 Write)
PC
Calc.
Issue Packet =
More than 1 instruction
Inte
ger S
lot
LD
/ST
Slo
t
I-Cache
D-Cache
ALU
Reg.File
PC
AddrCalc
Inte
ger S
lot
LD
/ST
Slo
t
INT/BRANCH
LD/ST
add $4,%rdi
st %r9,0(%rdi)
ld 0(%rdi),%r9
add $-1,%esiadd $5,%r9
jne $0,%esi,L1
nop
nop nop
nop
Int./Branch Slot LD/ST Slot
ld 0(%rdi),%r9
add $-1,%esi
add $5,%r9
add $4,%rdi st %r9,0(%rdi)
jne $0,%esi,L1
12.18
Loop Unrolling
• Often not enough ILP w/in a single iteration (body) of a loop
• However, different iterations of the loop are often independent and can thus be run in parallel
• This parallelism can be exposed in static issue machines via loop unrolling– Copy the body of the loop k times and iterate only n / k times
– Instructions from different body iterations can be run in parallel
void f1(int *A, int n) { for( ; n != 0; n--, A++) *A += 5; }
// Loop unrolled 4 times void f1(int *A, int n) { // assume n is a multiple of 4 for( ; n != 0; n-=4, A+=4) { *A += 5; *(A+1) += 5; *(A+2) += 5; *(A+3) += 5; }}
12.19
Loop Unrolling
# %rdi = A# %esi = n = # of iterationsL1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) add $4,%rdi add $-1,%esi jne $0,%esi,L1
void f1(int *A, int n) { for( ; n != 0; n--, A++) *A += 5; }
// Loop unrolled 4 timesfor(i=0; i < MAX; i+=4){ A[i] = A[i] + 5; A[i+1] = A[i+1] + 5; A[i+2] = A[i+2] + 5; A[i+3] = A[i+3] + 5;}
# %rdi = A# %esi = n = # of iterationsL1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) ld 4(%rdi),%r9 add $5,%r9 st %r9,4(%rdi) ld 8(%rdi),%r9 add $5,%r9 st %r9,8(%rdi) ld 12(%rdi),%r9 add $5,%r9 st %r9,12(%rdi) add $16,%rdi add $-4,%esi jne $0,%esi,L1
Original Code
A side effect of unrolling is the reduction of overhead instructions (less
branches and counter/ptr. updates
Unrolled Code
12.20
Code Movement & Data Hazards• To effectively schedule the code, the compiler will often move
code up or down but must take care not to change the intended program behavior
• Must deal with WAR (Write-After-Read) and WAW (Write-After-Write) hazards in addition to RAW hazards when moving code– WAW and WAR hazards are not TRUE hazards (no data communication
between instrucs.) but simply conflicts because we want to use the same register…we call them name dependencies or antidependences!
– How can we solve? Register renaming.
L1: add %r8, %r9 add %r9, %r10 ld 0(%r11),%r9
LD instruction is REALLY independent (only needs %r11).Could LW instruction be moved between the 2 add’s?Not as is, WAR hazard
L1: add %r8, %r9 add %r9, %r10 ld 0(%r11), %r9 sub %r9, %r12
Could LD instruction be run in parallel with or before first add?Not as is, WAW hazard
L1: add %r8, %r9 ld 0(%r11), %r9 add %r9, %r10
read %r9
write %r9
Original
ProposedWrong %r9 used
L1: ld 0(%r11), %r9 add %r8, %r9 add %r9, %r10 sub %r9, %r12
Proposed
Original
Wrong %r9 used
write %r9
write %r9
12.21
Register Renaming
• Unrolling is not enough because even though each iteration is independent there are conflicts in the use of registers (%r9 in this case)– Can’t move another 'ld' instruction up until
'st' is complete due to a WAR hazard even though there is not a true data dependence
• Since there is no true dependence (ld does not need data from 'st' or 'add' above) we can solve the problem by register renaming
• Register Renaming: Using different registers to solve WAR / WAW hazards
# %rdi = A# %esi = n = # of iterationsL1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) ld 4(%rdi),%r9 add $5,%r9 st %r9,4(%rdi) ld 8(%rdi),%r9 add $5,%r9 st %r9,8(%rdi) ld 12(%rdi),%r9 add $5,%r9 st %r9,12(%rdi) add $16,%rdi add $-4,%esi jne $0,%esi,L1
? read %r9
write %r9
write %r9
12.22
Scheduling w/ Unrolling & Renaming
• Schedule the following loop body on our 2-way static issue machine# %rdi = A# %esi = n = # of iterationsL1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) ld 4(%rdi),%r10 add $5,%r10 st %r10,4(%rdi) ld 8(%rdi),%r11 add $5,%r11 st %r11,8(%rdi) ld 12(%rdi),%r12 add $5,%r12 st %r12,12(%rdi) add $16,%rdi add $-4,%esi jne $0,%esi,L1
Int./Branch Slot LD/ST Slot
ld 0(%rdi),%r9
add $-4,%esi ld 4(%rdi),%r10
add $5,%r9 ld 8(%rdi),%r11
add $5,%r10 ld 12(%rdi),%r12
add $5,%r11 st %r9,0(%rdi)
add $5,%r12 st %r10,4(%rdi)
add $16,%rdi st %r11,8(%rdi)
jne $0,%esi,L1 st %r12,-4(%rdi)
w/ Loop Unrolling and Register Renaming(Notice how the compiler would have to modify the code to
effectively reschedule)
IPC = 15 instrucs. / 8 cycle = 1.875
12.23
Scheduling w/ Unrolling & Renaming
• Schedule the following loop body on our 2-way static issue machine# %rdi = A# %esi = n = # of iterationsL1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) ld 4(%rdi),%r10 add $5,%r10 st %r10,4(%rdi) ld 8(%rdi),%r11 add $5,%r11 st %r11,8(%rdi) ld 12(%rdi),%r12 add $5,%r12 st %r12,12(%rdi) add $16,%rdi add $-4,%esi jne $0,%esi,L1
Int./Branch Slot LD/ST Slot
12.24
Data Dependency Hazards Summary
• RAW = Only real data dependency– Must be respected in terms of code movement and ordering
– Forwarding reduces latency of dependent instructions
• WAW and WAR hazards = Anti-dependencies– Solved by using register renaming
• RAR = No issues / dependencies
12.25
Loop Unrolling / Register Renaming
• Loop unrolling increases code size(memory needed to store instructions)
• Register renaming burns more registers and thus may require HW designers to add more registers
• Must have some independence between loop bodies– Dependence between iterations known as loop carried dependence
// Dependence between iterations A[0] = 5;for (i=1; i < MAX; i++) { A[i] = A[i-1] + 5;}
12.26
Memory Hazard Issue• Suppose %rsi and %rdi are passed in as arguments to a
function, can we reorder the instructions below?– No, if %rsi == %rdi we have a data dependency in memory
• Data dependencies can occur via memory and are harder for the compiler to find at compile time forcing it to be conservative
ld 0(%rsi),%edx addl $5,%edx st %edx,0(%rsi)
ld 0(%rdi),%eax addl $1,%eax st %eax,0(%rdi)
We moved the 2nd ‘ld’ up to enhance performance... Is it equivalent? No…Need to wait for ‘st’ !!
Int./Branch Slot LD/ST Slot
ld 0(%rsi),%edx
ld 0(%rdi),%eax
addl $5,%edx
addl $1,%eax st %edx,0(%rsi)
st %eax,0(%rdi)
Can we reorder these instructions?
12.27
Memory Disambiguation
• Data dependencies occur in memory and not just registers!
• Memory RAW dependencies are also made harder because of different ways of addressing the same memory location– Can the following be reordered?
st %eax, 4(%rdi)ld -12(%rsi), %ecx
– No! What if %rsi = %rdi + 16
• Memory disambiguation refers to the process of determining if a sequence of stores and loads references the same address (ordering often needs to be maintained)
• We can only reorder LD and ST instructions if we can disambiguate their addresses to determine any RAW, WAR, WAW hazards– LD -> LD is always fine (RAR)
– ST -> LD (RAW), LD -> ST (WAR) or ST -> ST (WAW) hazards that need to be disambiguated
12.28
Itanium 2 Case Study
• Max 6 instruction issues/clock
• 6 Integer Units, 4 Memory units, 3 Branch, 2 FP– Although full utilization is rare
• Registers– (128) 64-bit GPR’s
– (128) FPR’s
• On-chip L3 cache (12 MB of cache memory)
I won’t be surprised at all if the whole multithreading idea turns out to be a flop, worse than the "Itanium" approach that was supposed to be so terrific—until it turned out that the wished-for compilers were basically impossible to write.
Don Knuthhttps://www.informit.com/articles/article.aspx?p=1193856
12.29
Static Multiple Issue Summary
• Compiler is in charge of reordering, renaming, unrolling original program code to achieve better performance
• Processor is designed to fetch/decode/execute multiple instructions per cycle in the order determined by the compiler
• Pros: HW can be simpler and thus faster/smaller– More cores
– Potentially higher clock rates
• Cons: Requires recompilation– No support for legacy software
Exercise: 2-way VLIW Scheduling
void f1(int *A, int *B, int N) {
do {
int temp = *A;
*A = temp + *B + 9;
*B = temp;
A--; B--; N--;
} while (N != 0);
}
.L1:
ld (%rdi),%eax ; load temp=*A
ld (%rsi),%ebx ; load *B
add %eax,%ebx ; add temp+*B
add $9,%ebx ; add 9
st %ebx,(%rdi) ; store *A
st %eax,(%rsi) ; store *B
add $-4,%rdi ; dec. A ptr.
add $-4,%rsi ; dec. B ptr.
add $-1,%edx
jne $0,%edx,.L1 ; loop
=== INTEGER SLOT ===
// nop
add %eax,%ebx
add $9,%ebx
add $-4,%rdi
add $-4,%rsi
add $-1,%edx
jne $0,%edx,.L1
=== LD/ST SLOT ===
ld (%rdi),%eax
ld (%rsi),%ebx
st %ebx,(%rdi)
st %eax,(%rsi)
Unoptimized Schedule
Exercise 1: You can move or modify code, but cannot apply loop unrolling or register renaming.
Solution: 2-way VLIW Scheduling
=== INTEGER SLOT ===
// nop
add %eax,%ebx
add $9,%ebx
add $-4,%rdi
add $-4,%rsi
add $-1,%edx
jne $0,%edx,.L1
=== LD/ST SLOT ===
ld (%rdi),%eax
ld (%rsi),%ebx
st %ebx,(%rdi)
st %eax,(%rsi)
Unoptimized Schedule=== INTEGER SLOT ===
add $-4,%rdi
add $-4,%rsi
add $-1,%edx
add %eax,%ebx
add $9,%ebx
jne $0,%edx,.L1
=== LD/ST SLOT ===
ld (%rdi),%eax
ld (%rsi),%ebx
st %eax,4(%rsi)
st %ebx,4(%rdi)
Move Up and Modify Offsets
IPC = 10 instructions / 6 clocks = 1.67
Note: ● Modified offset for st● Intermediate instruction between
load into %ebx and its use by add
Exercise 2: You can unroll the loop once (2 iterations) with register renaming.
Unrolling the loop with register renaming
.L1:
ld (%rdi),%eax ; load temp=*A
ld (%rsi),%ebx ; load *B
add %eax,%ebx ; add temp+*B
add $9,%ebx ; add 9
st %ebx,(%rdi) ; store *A
st %eax,(%rsi) ; store *B
ld -4(%rdi),%r8d ; 2nd iter
ld -4(%rsi),%r9d ;
add %r8d,%r9d ;
add $9,%r9d ;
st %r9d,-4(%rdi) ;
st %r8d,-4(%rsi) ;
add $-8,%rdi ; dec. A ptr.
add $-8,%rsi ; dec. B ptr.
add $-2,%edx
jne $0,%edx,.L1 ; loop
Loop Unrolling / Register Renaming=== INTEGER SLOT ===
add $-8,%rdi
add $-8,%rsi
add $-2,%edx
add %eax,%ebx
add $9,%ebx
add %r8d,%r9d
add $9,%r9d
jne $0,%edx,.L1
Move Up and Modify Offsets
IPC = 16 instructions / 8 clocks = 2
Note: intermediate instructions between loads and uses of a register.
=== LD/ST SLOT ===
ld (%rdi),%eax
ld (%rsi),%ebx
ld 4(%rdi),%r8d
ld 4(%rsi),%r9d
st %eax,8(%rsi)
st %ebx,8(%rdi)
st %r8d,4(%rsi)
st %r9d,4(%rdi)
Reversed%rsi / %rdi
IncreasedOffset
Exercise: Solve WAR/WAW hazards
Solve WAR/WAW hazards of the following code through renaming.
ld 0(%rdi),%rax
add %rcx,%rax
sub %rbx,%rcx
ld 0(%rsi),%rbx
sub %rsi,%rbx
add %rbx,%rbx
ld 0(%rdi),%rax
add %rcx,%rax
sub %rbx,%rcx
ld 0(%rsi),%r8
sub %rsi,%r8
add %r8,%r8
12.34
STATIC SCHEDULING EXAMPLE
12.35
12.36
12.37
DYNAMIC MULTIPLE ISSUE MACHINES
HW-based solutions
12.38
Overcoming the Memory Latency
• What happens to instruction execution if we have a cache miss?– All instructions behind us need to stall!
– Could take potentially hundreds of clock cycles to fetch the data
• Can we overcome this?
I-CacheD-Cach
e
ALUReg.File(4
Read, 2 Write)
PC
.Calc
Inte
ger S
lot
LD
/ST
Slo
t
I-Cache
D-Cache
ALU
Reg.File
PC
AddrCalc
Inte
ger S
lot
LD
/ST
Slo
t
ld 0(%rdi),%rcx
Miss
STALL
STALL
STALL
12.39
Out-Of-Order Execution
• Idea: Have processor find dependencies as instructions are fetched/decoded and execute independent instructions that come after stalled instructions– Known as Out-of-Order Execution or Dynamic Scheduling
– HW will determine the "dependency" graph at runtime and as long as an instruction isn't waiting for an earlier instruction, let it execute!
# %rdi = A# %esi = n = # of iterations# %rdx = sf1: ld 0(%rdx),%r8 add $1,%r8 st %r8,0(%rdx)L1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) add $4,%rdi add $-1,%esi jne $0,%esi,L1
void f1(int *A, int n, int *s) { *s += 1; for( ; n != 0; n--, A++) *A += 5; }
ld 0(%rdx),%r8
add $1,%r8
st %r8,0(%rdx)
ld 0(%rdi),%r9
add $5,%r9
st %r9,0(%rdi)
True (RAW or control) dependence
Miss
CACHE MISS
STALL
STALL
independent
independent
independent
12.40
OoO Execution: Tomasulo's Algorithm
I-Cache
Block Diagram Adapted from Prof. Michel Dubois
(Simplified for CS356)
Register Status Table
Integer / Branch D-Cache Div Mul
Instruc. Queue
Reg
. File
Int.
Que
ue
L/S
Que
ue
Div
Que
ue
Mul
t. Q
ueue
Common Data Bus
Issue Unit
Dispatch
Fetch multiple instructions per clock cycle in PROGRAM ORDER (i.e. normal order generated by the compiler)
Decode & dispatch multiple instructions per cycle tracking dependencies on earlier instructions
Instructions wait in queues until their respective functional unit (the hardware that will compute their value) is free AND they have their data available (from the instructions they depend upon).
Results of multiple instructions can be written back per cycle. Results are broadcast to any instruction waiting for that result.
Tracks which instruction is the latest producer of a register. (Helps track the dependencies)
# %rdi = A# %esi = n = # of iterations# %rdx = sf1: ld 0(%rdx),%r8 add $1,%r8 st %r8,0(%rdx)L1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) add $4,%rdi add $-1,%esi jne $0,%esi,L1
12.41
I-Cache
Register Status Table
2:addl $1,[res1]
3:st [res2],0(%rdx)
1:ld 0(%rdx),%r8
Integer / Branch D-Cache Div Mul
Instruc. Queue
Reg
. File
Int.
Que
ue
L/S
Que
ue
Div
Que
ue
Mul
t. Q
ueue
Common Data Bus
Issue Unit
DispatchInstructions wait in queues until their respective functional unit (the hardware that will compute their value) is free AND they have their data available (from the instructions they depend upon).
Results of multiple instructions can be written back per cycle. Results are broadcast to any instruction waiting for that result.
# %rdi = A# %esi = n = # of iterations# %rdx = sf1: ld 0(%rdx),%r8 add $1,%r8 st %r8,0(%rdx)L1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) add $4,%rdi add $-1,%esi jne $0,%esi,L1
Miss
Assume we can dispatch 3 instructions per cycle
OoO Execution: Tomasulo's Algorithm
12.42
I-Cache
Register Status Table
5:addl $5,[res4]
2:addl $1,[res1]
6:st [res5],0(%rdi)
4:ld 0(%rdi),%r9
3:st [res2],0(%rdx)
1:ld 0(%rdx),%r8
Integer / Branch D-Cache Div Mul
Instruc. Queue
Reg
. File
Int.
Que
ue
L/S
Que
ue
Div
Que
ue
Mul
t. Q
ueue
Common Data Bus
Issue Unit
DispatchInstructions wait in queues until their respective functional unit (the hardware that will compute their value) is free AND they have their data available (from the instructions they depend upon).
Results of multiple instructions can be written back per cycle. Results are broadcast to any instruction waiting for that result.
# %rdi = A# %esi = n = # of iterations# %rdx = sf1: ld 0(%rdx),%r8 add $1,%r8 st %r8,0(%rdx)L1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) add $4,%rdi add $-1,%esi jne $0,%esi,L1
STALLED
Assume we can dispatch 3 instructions per cycle
OoO Execution: Tomasulo's Algorithm
12.43
I-Cache
Register Status Table
5:addl $5,[res4]
2:addl $1,[res1]
6:st [res5],0(%rdi)
4:ld 0(%rdi),%r9
3:st [res2],0(%rdx)
1:ld 0(%rdx),%r8
Integer / Branch D-Cache Div Mul
Instruc. Queue
Reg
. File
Int.
Que
ue
L/S
Que
ue
Div
Que
ue
Mul
t. Q
ueue
Common Data Bus
Issue Unit
DispatchInstructions wait in queues until their respective functional unit (the hardware that will compute their value) is free AND they have their data available (from the instructions they depend upon).
Results of multiple instructions can be written back per cycle. Results are broadcast to any instruction waiting for that result.
# %rdi = A# %esi = n = # of iterations# %rdx = sf1: ld 0(%rdx),%r8 add $1,%r8 st %r8,0(%rdx)L1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) add $4,%rdi add $-1,%esi jne $0,%esi,L1
STALLED
#4 LD 0(%rdi),%r9 has all its data and thus can jump ahead of the other stalled LD and the ST that is dependent on that LD. It executes out of order.
OoO Execution: Tomasulo's Algorithm
12.44
I-Cache
Register Status Table
8: add $-1,%esi
7: add $4,%rdi
5:addl $5,[res4]
2:addl $1,[res1]
6:st [res5],0(%rdi)
3:st [res2],0(%rdx)
1:ld 0(%rdx),%r8
Integer / Branch D-Cache Div Mul
Instruc. Queue
Reg
. File
Int.
Que
ue
L/S
Que
ue
Div
Que
ue
Mul
t. Q
ueue
Common Data Bus
Issue Unit
DispatchInstructions wait in queues until their respective functional unit (the hardware that will compute their value) is free AND they have their data available (from the instructions they depend upon).
Results of multiple instructions can be written back per cycle. Results are broadcast to any instruction waiting for that result.
# %rdi = A# %esi = n = # of iterations# %rdx = sf1: ld 0(%rdx),%r8 add $1,%r8 st %r8,0(%rdx)L1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) add $4,%rdi add $-1,%esi jne $0,%esi,L1
STALLED
Once the #4 LD reads from cache (assume cache hit) it will broadcast its value and the dependent #5 ADD will pick it up and be able to execute.Meanwhile, the next instructions can be dispatched
Result of [4: ld 0(%rdi),%r9]
OoO Execution: Tomasulo's Algorithm
12.45
I-Cache
Register Status Table
2:addl $1,[res1]
6:st [res5],0(%rdi)
3:st [res2],0(%rdx)
1:ld 0(%rdx),%r8
Integer / Branch D-Cache Div Mul
Instruc. Queue
Reg
. File
Int.
Que
ue
L/S
Que
ue
Div
Que
ue
Mul
t. Q
ueue
Common Data Bus
Issue Unit
DispatchInstructions wait in queues until their respective functional unit (the hardware that will compute their value) is free AND they have their data available (from the instructions they depend upon).
Results of multiple instructions can be written back per cycle. Results are broadcast to any instruction waiting for that result.
# %rdi = A# %esi = n = # of iterations# %rdx = sf1: ld 0(%rdx),%r8 add $1,%r8 st %r8,0(%rdx)L1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) add $4,%rdi add $-1,%esi jne $0,%esi,L1
STALLED
Once the #5 ADD computes its sum it will broadcast its value and the dependent #6 ST will pick it up and be able to execute.Meanwhile other ADDs can execute and complete.
Result of [7: addl $4,%rdi]
8: add $-1,%esi
7: add $4,%rdi
OoO Execution: Tomasulo's Algorithm
12.46
I-Cache
Register Status Table
2:addl $1,[res1]
3:st [res2],0(%rdx)
Integer / Branch D-Cache Div Mul
Instruc. Queue
Reg
. File
Int.
Que
ue
L/S
Que
ue
Div
Que
ue
Mul
t. Q
ueue
Common Data Bus
Issue Unit
DispatchInstructions wait in queues until their respective functional unit (the hardware that will compute their value) is free AND they have their data available (from the instructions they depend upon).
Results of multiple instructions can be written back per cycle. Results are broadcast to any instruction waiting for that result.
# %rdi = A# %esi = n = # of iterations# %rdx = sf1: ld 0(%rdx),%r8 add $1,%r8 st %r8,0(%rdx)L1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) add $4,%rdi add $-1,%esi jne $0,%esi,L1
MISS RESOLVED
When the cache finally resolves the miss, #1 LD will broadcast its value and the dependent #2 ADD will pick it up and be able to execute.From there the #3 ST will be able to execute next.
Result of [1: ld 0(%rdx),%r8]
OoO Execution: Tomasulo's Algorithm
12.47
Dynamic Multiple Issue
• Burden of scheduling code for parallelism is placed on the HW and is performed as the program runs (not necessarily at compile time)– Compiler can help by moving code, but HW guarantees correct
operation no matter what
• Goal is for HW to determine data dependencies and let independent instructions execute even if previous instructions (dependent on something) are stalled– We call this a form of Out-of-Order Execution
12.48
Problems with OoO Execution
• What if an exception (e.g. page fault) occurs in an earlier instruction AFTER later instructions have already completed– OS will save the state of the program and handle the page miss
– When OS resumes it will restart the process at the ST instruction
– The subsequent instructions will execute for a 2nd time. BAD!!!
• We need to fetch and dispatch multiple instructions per cycle but when we hit a jump/branch, we don't know which way to fetch!
• Solution: Speculative Execution with ability to "Rollback"
# %rdi = A# %esi = n = # of iterations# %rdx = sf1: ld 0(%rdx),%r8 add $1,%r8 st %r8,0(%rdx)L1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) add $4,%rdi add $-1,%esi jne %r8,%esi,L1 next instruc.
Completed
Resume after STALLING
Page Fault!
Where next
12.49
SPECULATIVE EXECUTION
12.50
Speculation w/ Dynamic Scheduling
• Basic block size of 5-7 instructions between branches limits ability to issue/execute instructions – For safety, we might consider stalling (stop
dispatching instructions) until we know the outcome of a jump/branch
• But speculation allows us to predict a branch outcome and continue issuing and executing down that path
• Of course, we could be wrong so we need the ability to roll-back the instructions we should not have executed if we mispredict
• We add a structure known as the commit unit (or re-order buffer / ROB)
Basic B
lock
ld 0(%r8),%r9 and %r10,%r11L1: add %r8,%r12 or %r11,%r13 sub %r14,%r10 jeq %r12,%r14,L1 xor %r10,%r15
STALL until we know outcome
Cache Miss
12.51
Out-Of-Order Diagram
In-OrderIssue Logic( > 1 instruc.
per clock)
INT MUL/DIV
INT ALU
FPADD
FPMUL/DIV
Load/Store
Com
mit
Uni
tOut-of-Order
In-Order
Results forwarded to dependent (RAW) instructions
Writeback(to D$ or registers)
Re-Order Buffer(ROB)
Temporarily buffers results of instructions
until earlier instructions complete and writeback
IN ORDER
Queues for functional
units
Completed instructions waiting to complete in-order
Instruc. i+n = newest
Instruc. i = oldest
Front End
Back End
12.52
Commit Unit (ROB)
• ROB tail entry is allocated for each instruction on issue (ensuring instruction entries are maintained in program order)
• When an instruction completes, its results are forwarded to others but is also stored in the ROB (rather than writing back to reg. file) until it reaches the head of the ROB
• Commit unit only commits the instructions at the head of the queue and only when they are fully completed– Ensures that everything committed was correct
– If we hit an exception or misspeculate/mispredict a branch then throw away everyone behind it in the ROB and start fresh using the correct outcome
Com
mit
Uni
t
STJEQ
XOR / ADD ????
LDANDADDSUB
Writeback(to D$ or registers)
Re-Order Buffer(ROB)
Tail
Head
(Orange): Executed instrucs. and waiting to write back
(Gray): Stalled or not yet executed
ld 0(%r8),%r9 and %r9,%r11L1: add %r8,%r12 sub %r8,%r10 st %r9,0(%r13) jeq %r9,%r14,L1 xor %r10,%r15
12.53
Commit Unit (ROB)
• What happens if the ST instruction that is STALLED ends up causing a page fault… – ROB allows us to throw away instructions
after it and replay them after the page fault is handled
• When we get to the JEQ, we don't know %r9 so we'll just guess (predict) the outcome and fetch down that predicted path– If we mispredict, ROB allows us to throw
away instruction results
Com
mit
Uni
t
STJEQ
XOR / ADD ????
LDANDADDSUB
Writeback(to D$ or registers)
Re-Order Buffer(ROB)
Tail
Head
ld 0(%r8),%r9 and %r9,%r11L1: add %r8,%r12 sub %r8,%r10 st %r9,0(%r13) jeq %r9,%r14,L1 xor %r10,%r15
Exception!
Mispredicted
12.54
Branch Prediction + Speculation
• To keep the backend fed with enough instructions we need to predict a branch's outcome and perform "speculative" execution beyond the predicted (unresolved) branch– Roll back mechanism (flush) in case of misprediction
Conditional branchesBasic Block
Head of ROB
Speculative Execution
Path
NT-path T-path
12.55
Speculation Example
• Predict branches and execute most likely path– Simply flush ROB entries after
the mispredicted branch
– Need good prediction capabilities to make this useful
T NT
NTT
ROB Head(Assume stall)
Spec. Path
Com
mit
Uni
t
Time 2b: Flush ROB/Pipeline of instructions behind it
Com
mit
Uni
t
Time 1: ROBRed Entries = Predicted
Branches
Com
mit
Uni
t
Time 3: ROBPipeline begins to fill w/
correct path
Com
mit
Uni
t
Time 2a: ROBBlack Entry = Mispredicted
branch
Bas
ic B
lock
Bas
ic B
lock
Bas
ic B
lock
Bas
ic B
lock
Bas
ic B
lock
Correct Path
Wro
ng-P
ath
Exec
utio
n
CS356: Meltdown & SpectreOr... The Dangers of Caches and Speculation
Marco Paolieri (paolieri@usc.edu)Illustrations from meltdownattack.com
Virtual memory isolates processes, right?
Programs are not permitted to read data from other programs… and yet:
● Meltdown allows a program to access the memory (and secrets) of other programs and the operating system. It works in the cloud too!
● Spectre allows an attacker to trick error-free programs, which follow best practices, into leaking their secrets. Safety checks of said best practices actually increase the attack surface and may make applications more susceptible to Spectre. Harder to exploit but also harder to mitigate.
Wait, but:● Am I affected by the vulnerability? Most certainly, yes.● Where’s the problem? Out-of-order execution (Intel since 1995, some ARM)● Can I detect if someone has exploited them against me? Probably not.● Can my antivirus detect attackers? Unlikely until a binary becomes known.● What can be leaked? Anything in your RAM.● Is there a fix? OS patches against Meltdown, compiler patches for Spectre.
In practice: reading other programs’ memory
Meltdown: Melts HW Security Boundaries
Isolation between kernel and user processes● Supervisor bit in the CPU is set when entering kernel code.● Kernel memory pages can be read only then the supervisor bit is set.● No change of memory mapping between kernel mode and user mode.
● Meltdown uses side-channel information leaked by out-of-order execution to read kernel virtual memory.
● In the kernel virtual memory, there is a region that gives access to all of the physical RAM.
● It’s not a bug of the OS, but a problem of modern CPUs. All operating systems are affected.
So, what’s the problem with out-of-order?
Out-of-Order and Speculative Execution
Out of Order. While the execution unit (ALU, LD, ST) of the current instruction is busy, other instructions can run ahead if resources and data are available.
Speculative Execution. Let the CPU process instructions even when it’s not certain that they will be needed. If they are not needed, just undo their effects (rollback); otherwise commit.● Example: instructions after a branch
How? Tomasulo’s algorithm (1967)● Queue instructions in scheduler● Connect all units with CDB● Run them when data is available● Rollback if order was wrong
Cache Side-Channel Attacks: Flush+Reload
Flush+Reload1. Attacker frequently flushes the shared last-level cache (clflush)2. Checking the time to read from an address, the attacker determines
whether data within the same line was read by the user.
Variant: Evict+Reload. Instead of clflush, force eviction through many reads.
Covert ChannelThe attacker flushes the last-level cache and tries to access an address. Why?To let some information leak from one security domain to another.
An example: guessing the value of data
● After the exception, the flow continues in the kernel.● The exception handler terminates the program.● But due to out-of-order execution, the array access is performed…
… and rolled back later when the exception happens.But the effects on the cache are still present… a cache line was read.
Guessing values of kernel data
Transient InstructionsCompleted but not committed(they introduce a side channel)
● Accessing kernel data from user-space causes an exception.
● If we have instructions that use that data, they will certainly be rolled back.
● Can we leave a trace of the data in the state of the microarchitecture? Like, in the cache?
Avoiding the exceptionThe exception will terminate your attack program. Solutions:● Do the illegal access from a child process, recover secret data from parent● Trick the branch predictor into thinking that the illegal code should run, when
it won’t: this schedules your code, without an exception.
Sending data through the covert channel
Very simple!
● Fix a monitored memory address x● The transient instruction reads kernel data and transmits it 1 bit at a time
○ Flush the cache; if the bit is 1, read from x● The recovery code (in user space) reads from x
○ If the read is fast, the value was 1
Using many addresses in different cache lines, we can transmit multiple bits at a time.
Meltdown can transmit data at 500 kB/s. Not bad!
KASLR
Kernel Address Space Layout RandomizationSame as ASLR, but instead of randomizing the stack position, randomize the position of the kernel virtual memory.
Enabled by default in Linux 4.12
The location of the physical memoryrange is also randomized… ● but only by 40 bits● it takes 128 tests to cover 8 GB
So, it can be circumvented.
Linux KAISER Patch
Stronger isolation between userspace and kernel spaceDon’t map kernel data in user-space virtual memory(except for a few things like interrupt handlers)
Now KASLR is usefulThere is little kernel dataIt’s hard to guess where it is
Meltdown doesn’t work on AMDProbably because access bits are checked in the page table before LD/ST
Out of Order Execution + Side Channels
4-Step Recipe for a Meltdown
1. Vulnerable out-of-order CPU allows an unprivileged process to load data from a privileged address (kernel and all physical RAM) into a register.
2. The CPU performs computations based on the value of this register (e.g., use the register as array index to read from memory).
3. If it turns out that we didn’t have the right to read the data, the results of these operations are thrown away.
4. Seems safe but… out-of-order memory lookups influence cache hit/miss: from this side channel, we can find out what the value was in the register.
Read the original paper!https://arxiv.org/abs/1801.01207
Spectre: The Risks of Speculative Execution
1. Locate or introduce a sequence of instructions within the process address space which, when executed, act as a covert channel transmitter leaking the victim’s memory or register contents.
2. Trick the CPU into speculatively and erroneously executing the instructions3. Retrieve information from covert channel.
How to trick the CPU into wrong speculations?
● Mistrain the conditional branch predictor.For example: if there’s an if guard, invoke it many times with valid data.Then pass invalid data… the branch predictor will take the “is valid” branch
● Mistrain the indirect branch predictor.For example: the attacker jumps many times to the virtual address of a gadget in the attacked program. The attacked program will jump too...
Spectre with branch misprediction
Idea● Suppose that we can choose x● First use many valid x values● Then, pick x out of bounds to read
secret data of attacked program● If array1_size, array2 not cached:
predictor will make a wrong guess● Speculative execution causes a read
on page number equal to secret● We can scan all the pages and check
what the secret was from access times
It works in the browser too!
● Cannot use clflush for Flush+Reload, must use Evict+Reload● The clflush avoids optimizations, |0 triggers optimizations● The secret n will be used to access probeTable[n*4096]● Timing errors are possible due to other processes, but error rate is 0.005%
Processors at risk
Most● Intel
○ Ivy Bridge○ Haswell○ Broadwell○ Skylake○ Kaby Lake
● AMD Ryzen● ARM-based Samsung and Qualcomm processors
(And it works with native code or even sandboxed Javascript.)
Difference with respect to Meltdown● Meltdown does not mistrain branch prediction● Meltdown bypasses memory protection between userspace and kernel● Spectre does not: it just makes the CPU jump to the wrong place and run
instructions there to leak information. Cannot be fixed with KAISER.
Countermeasures
Preventing speculative executionVery expensive!
Preventing access to secret data● Run websites in separate processes● Replace array bounds checking with indexed masking to ensure locality● Pointer poisoning: (pointer XOR secret)
Only with secret you can recover/follow the pointer, and branch predictions use different statistics.
Prevent data to be leaked through cachesNot possible with current processors
Read the original paper!https://arxiv.org/abs/1801.01203
12.73
BONUS MATERIALNot responsible for this material
12.74
BRANCH PREDICTION
12.75
Branch Prediction
• Since basic blocks are often small, multiple issue (static or dynamic) processors may encounter a branch every 1-2 cycles
• We not only need to know the outcome of the branch but the target of the branch– Branch target: Branch target buffer (cache)
– Branch outcome: Static (compile-time) or dynamic (run-time / HW assisted) prediction techniques
• To keep the pipeline full and make speculation efficient and not wasteful, the processor needs accurate predictions
12.76
Branch Target Availability• Branches perform PC = PC + displacement where displacement is stored as
part of the instruction
• Usually can’t get the target until after the instruction is completely fetched (displacement is part of instruction)– May be 2-3 cycles in a deeply pipelined processor
(ex. [I-TLB, I-Cache Lookup, I-Cache Access, Decode]
– If a 4-way superscalar and 3 cycle branch penalty, we throw away 12 instructions on a misprediction
• Key observation: Branches always branch to same place (target is constant for each branch)
PC I-TLB
AccessI-$
Look-upP
ipe
Reg
.
I-$AccessP
ipe
Reg
.
Decode
Pip
e R
eg.
Pip
e R
eg.
Branch instruction still being fetched
Branch instruction available here
Branch target (PC+d) available here
Would like to have branch target available in 1st stage
12.77
Finding the Branch Target
• Key observation: Jump/branches always branch to same place (target is constant for each branch)– The first time we fetch a jump we'll have no idea where it is going to jump to and thus
have to wait several cycles
– But let's save the address where the jump instruction lives AND where it wants to jump to (i.e. the target)
– Next time the PC gets to the starting address of the jump we can lookup the target quickly
– Keep all this info in a small "branch target cache/buffer"
00000000004004d6 <sum>: 4004d6: 85 f6 test %esi,%esi 4004d8: 7e 1d jle 4004f7 <sum+0x21> 4004da: 48 89 fa mov %rdi,%rdx 4004dd: 8d 46 ff lea -0x1(%rsi),%eax 4004e0: 48 8d 4c 87 04 lea 0x4(%rdi,%rax,4),%rcx 4004e5: b8 00 00 00 00 mov $0x0,%eax 4004ea: 03 02 add (%rdx),%eax 4004ec: 48 83 c2 04 add $0x4,%rdx 4004f0: 48 39 ca cmp %rcx,%rdx 4004f3: 75 f5 jne 4004ea <sum+0x14> 4004f5: eb 05 jmp 4004fc <sum+0x26> 4004f7: b8 00 00 00 00 mov $0x0,%eax 4004fc: c6 05 36 0b 20 00 01 movb $0x1,0x200b36(%rip) 400503: c3 retq
Jump Instruc. Address Jump Target
0x4004d8 0x4004f7
0x4004f3 0x4004ea
0x4004f5 0x4004fc
12.78
Branch Target Buffer
• Idea: Keep a cache (branch target buffer / BTB) of branch targets that can be accessed using the PC in the 1st stage– Cache holds target addresses and is accessed using the PC (address of
instruction)
– First time a branch is executed, cache will miss, and we’ll take the branch penalty but save its target address in the cache
– Subsequent accesses will hit (until evicted) in the BTB and we can use that target if we predict the branch is taken.
• Note: BTB is a “fully-associative” cache (search all entries for PC match)…thus it can’t be very large
PC I-TLB Access
BTBTarget PC
12.79
Branch Outcome Prediction
• Now that we have predicted the target, we now need to predict the outcome
• Static prediction– Have compiler make a fixed guess and put
that as a "hint" in the instruction itself
– Effective for loops
• Dynamic prediction– Some jumps are data dependent (e.g. if(x < y))
– Keep some "history"/records of each branches outcomes from the past & use that to predict the future
– Store that history in a cache
– Questions:• What history should we use to predict a branch
• How much history should we use/keep to predict a branch
Loop Body
Loop Check
“After” Code
T
NT
PC I-TLB Access
BTB Target PC(PC+d)
Predictors Outcome(T/NT)
12.80
Local vs. Global History
• What history should we look at? – Should we look at just the previous
executions of only the particular branch we’re currently predicting or at surrounding branches as well
• Local History: The previous outcomes of that branch only– Usually good for loop conditions
• Global History: The previous outcomes of the last m branches in time (other previous branches)
do { i--; if(x == 2) { … } if(y == 2) { … } if(x == y) { … } // Better: Local or Global }while (i > 0); // Better: Local or Global
12.81
Global (Correlating) Predictor
• Use the outcomes of the last m branches that were executed to select a prediction– Given last m jumps, 2m possible
combinations of outcomes & thus predictions• When jeq1=NT and jeq2=NT, predict jne = T,
when jeq1=NT and jeq2=T, predict jne = NT,etc.
– Branch predictor indexedby concatenating LSB’sof PC and m-bits of lastm branch outcomes
je ...,L1 (T = 1) ...je ...,L2 (NT = 0) ...jne ...,L3 // use history // of NT,T to // predict
PC I-TLB Access
PredictorsH
isto
rym-bits Concatenate
12.82
Tournament Predictor
• Dynamically selects when to use the global vs. local predictor– Accuracy of global vs. local
predictor for a branch may vary for different branches
– Tournament predictor keeps the history of both predictors (global or local) for a branch and then selects the one that is currently the most accurate
Tournament Selector
Local Prediction
Global Prediction
Predictor exhibiting greatest accuracy
12.83
Dynamic Scheduling Summary
• You can understand a modern architecture– https://en.wikichip.org/wiki/intel/microarchitectures/haswell_(client)
• Software implications– Code with a lot of branches will perform worse
than regular code
– Many cache misses will limit the performance
• But compared to a statically scheduled processor…
12.84
Pros/Cons of Static vs. Dynamic
• Static– HW can be simpler since compiler schedules code
– Compiler can see “deeper” in the code to find parallelism
– Used in many high-performance embedded processors like GPUs, etc. where code is more regular with high computation demand
• Dynamic– Allows for performance increase even for legacy software
– Can be better at predicting unpredictable branches
– HW structures do not scale well (ROB, reservation stations, etc.) beyond small sizes and more waste (time & power)
– Better for unpredictable, general purpose control code
12.85
PHYSICAL VS ARCHITECTURAL REGISTERS
12.86
Virtual Registers?
• In static scheduling, the compiler accomplished register renaming by changing the instruction to use other programmer-visible GPR’s
• In dynamic scheduling, the HW can "rename" registers on the fly– In the code on the left we would want %r9 to be
renamed to %r10, %r11, for each iteration
• Solution: A level of indirection– Let the register numbers be "virtual" and then
perform translation to a "physical" register
– Every time we write to the same register we are creating a "new version" of the register…so let's just allocate a physically different register
# %rdi = A# %esi = n = # of iterations# %rdx = sf1: ld 0(%rdx),%r8 addl $1,%r8 st %r8,0(%rdx)L1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) add $4,%rdi add $-1,%esi jne $0,%esi,L1L1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) add $4,%rdi add $-1,%esi jne $0,%esi,L1L1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) add $4,%rdi add $-1,%esi jne $0,%esi,L1
Trace of instructions over 3 loop iterations. Each iteration
is independent if we can rename %r9
12.87
Register Renaming
• Whenever an instruction produces a new value for a register, allocate a new physical register and update the table– Mark the old physical
register as "free"
– Mark the newly allocated register as "used"
• An instruction that wants to read a register just uses whatever physical register the current mapping table indicates
%r1%r0%r2
orig rsi
orig rdi
ld's r9%rdi%rsi%r9...
Physical Registers
Mapping Table%r5%r6%r7
%r0%r1%r2%r3%r4
# %rdi = A# %esi = n = # of iterations# %rdx = sf1: ld 0(%rdx),%r8 addl $1,%r8 st %r8,0(%rdx)L1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) add $4,%rdi add $-1,%esi jne $0,%esi,L1L1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) add $4,%rdi add $-1,%esi jne $0,%esi,L1L1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) add $4,%rdi add $-1,%esi jne $0,%esi,L1
Trace of instructions over 3 loop iterations. Each iteration
is independent if we can rename %r9
%r1%r0
%r2 %r3
orig rsi
add's r9
%rdi%rsi%r9...
Physical Registers
Mapping Table%r5%r6%r7
%r0%r1%r2%r3%r4
%r1 %r4%r0 %r5
%r6
add's esi
ld's r9
orig rsi
%rdi%rsi%r9...
Physical Registers
Mapping Table%r5%r6%r7
%r0%r1%r2%r3%r4
orig rdi
ld's r9
orig rdi
ld's r9
add's r9
add's rdi
12
3
ld 0(%rdi),%r9
add $5,%r9
add $4,%rdi add $-1,%esi
…ld 0(%rdi),%r9
used
used
used
used
used
free
used
free
free
free
freeused
used
used
12.88
Architectural vs. Physical Registers
• Architectural registers = The (16) x86 registers visible to the programmer or compiler
– Truly just names ("virtual")
– The mapping table needs 1 entry per architectural register
• Physical registers = A greater number of actual registers than architectural registers that is used as a “pool” for renaming
• Often a large pool of physical registers (80-128) to support large number of instructions executing at once or waiting in the commit unit
%r1%r0%r2
orig rsi
orig rdi
ld's r9%rdi%rsi%r9...
Physical Registers
"Architectural Registers"
%r5%r6%r7
%r0%r1%r2%r3%r4
# %rdi = A# %esi = n = # of iterations# %rdx = sf1: ld 0(%rdx),%r8 addl $1,%r8 st %r8,0(%rdx)L1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) add $4,%rdi add $-1,%esi jne $0,%esi,L1L1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) add $4,%rdi add $-1,%esi jne $0,%esi,L1L1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) add $4,%rdi add $-1,%esi jne $0,%esi,L1
Trace of instructions over 3 loop iterations. Each iteration
is independent if we can rename %r9
%r1%r0
%r2 %r3
orig rsi
add's r9
%rdi%rsi%r9...
Physical Registers
"Architectural Registers"
%r5%r6%r7
%r0%r1%r2%r3%r4
%r1 %r4%r0 %r5
%r6
add's esi
ld's r9
orig rsi
%rdi%rsi%r9...
Physical Registers
"Architectural Registers"
%r5%r6%r7
%r0%r1%r2%r3%r4
orig rdi
ld's r9
orig rdi
ld's r9
add's r9
add's rdi
12
3
ld 0(%rdi),%r9
add $5,%r9
add $4,%rdi add $-1,%esi
…ld 0(%rdi),%r9
used
used
used
used
used
free
used
free
free
free
freeused
used
used
Recommended