CS356 Unit 12b · Register Renaming • Unrolling is not enough because even though each iteration...

CS356 Unit 12b

Advanced Processor Organization

• Understand the terms and ideas used in a modern, high-performance processor

• Various systems have different kinds of processors and you should understand the pros and cons of each kind of processor

• Terms to listen for and understand the concept:– Superscalar/multiple issue, loop unrolling, register

renaming, out-of-order execution, speculation, and branch prediction

A New Instruction

• In x86, we often perform– cmp %rax, %rbx

– je L1 or jne L1

• Many instruction sets have a single instruction that both compares and jumps (limited to registers only)– je %rax, %rbx, L1

– jne %rax, %rbx, L1

• Let us assume x86 supports such an instruction in our subsequent discussion

INSTRUCTION LEVEL PARALLELISM

Have We Hit The Limit

• Under ideal circumstances, pipeline would allow us to achieve a throughput (IPC = Instructions per clock) of 1

• Can we do better? Can we execute more than one instruction per clock?– Not with a single pipeline

– But what if we had multiple "pipelines"

– What if we fetched multiple instructions per clock and let them run down the pipeline in parallel

• Let's exploit parallelism!

Exploiting Parallelism

• With increasing transistor budgets of modern processors (i.e., can do more things at the same time) the question becomes how do we find enough useful tasks to increase performance, or, put another way, what is the most effective ways of exploiting parallelism!

• Many types of parallelism available– Instruction Level Parallelism (ILP): Overlapping instructions within a

single process/thread of execution

– Thread Level Parallelism (TLP): Overlap execution of multiple processes/threads

– Data Level Parallelism (DLP): Overlap an operation (instruction) that is to be applied independently to multiple data values (usually, an array)

for(i=0; i < MAX; i++) { A[i] = A[i] + 5; }• We'll focus on ILP in this unit

ld 0(%r8), %r9and %r10, %r11or %r11, %r13sub %r14, %r15add %r10, %r12je $0,%r12,L1xor %r15, %rax

Instruction Level Parallelism (ILP)

• Although a program defines a sequential ordering of instructions, in reality many instructions can be executed in parallel.

• ILP refers to the process of finding instructions from a single program/thread of execution that can be executed in parallel

• Data flow (data dependencies) is what truly limits ordering– We call these dependencies RAW (Read-After-Write) Hazards

• Independent instructions can be parallelized

• Control hazards also provide ordering constraints

ld %r8,0(%r9) / and %r10,%r11 / sub %r14,%r15 / add %r10, %r12 / or %r11,%r13 / / je $0, %r12, L1 / / xor %r15, %rax /

Cycle 1:Cycle 2:Cycle 3:

LD ADD SUB AND

ORDependency

write %r11

read %r11

write %r15

read %r15

write %r12

read %r12

Basic Blocks

• Basic Block (def.) = Sequence of instructions that will always be executed together– No conditional branches out

– No branch targets coming in

– Also called "straight-line" code

– Average size: 5-7 instrucs.

• Instructions in a basic block can be overlapped if there are no data dependencies

• Control dependencies really limit our window of possible instructions to overlap– Without extra hardware, we can only overlap execution of

instructions within a basic block

This is a basic block (starts w/ target, ends with branch)

ld 0(%r8),%r9 and %r10,%r11L1: add %r8,%r12 or %r11,%r13 sub %r14,%r10 jeq %r12,%r14,L1 xor %r10,%r15

Superscalar

• When airplanes broke the sound barrier we said they were super-sonic

• When a processor (HW) can complete more than 1 instruction per clock cycle we say its are super-scalar

• Problem: The HW can execute 2 or more instructions during the same cycle but the SW may be written and compiled assuming 1 instruction executing at a time.

• Solutions– Static Scheduling: Recompile the code and rely on the

compiler to safely order instructions that can be run in parallel

– Dynamic Scheduling: Build the HW to be smart, reorder instructions on the fly while guaranteeing correctness

This Photo by Unknown Author is licensed under CC BY-NC-ND

Superscalar (Multiple Issue)

• Multiple "pipelines" that can fetch, decode, and potentially execute more than 1 instruction per clock– k-way superscalar

Ability to complete up to k instructions per clock cycle

• Benefits– Theoretical throughput greater than 1 (IPC > 1)

• Problems– Hazards

• Dependencies between instructions limiting parallelism

• Branch/jump requires flushing all pipelines

– Finding enough parallel instructions

Data Flow and Dependency Graphs

• The compiler produces a sequential order of instructions

• Modern processors will transform the sequential order to execute instructions in parallel

• Instructions can be executed in any valid topological ordering of the dependency graph

ADD SUB AND

ld 0(%r8), %r9and %r9, %r11or %r11, %r13sub %r14, %r15add %r10, %r12je $0,%r12,L1xor %r15, %r9

STATIC MULTIPLE ISSUE MACHINESCompiler-based solutions

Static Multiple Issue

• Compiler is responsible for finding and packaging instructions that can execute in parallel into issue packets– Only certain combinations of instructions can be in a packet together

– Instruction packet example with 3 instructions/packet:• (1) Integer/Branch instruction slot

• (1) LD/ST instruction

• (1) FP operation

• An issue packet is often thought of as an LONG instruction containing multiple instructions(a.k.a. Very Long Instruction Word)– Intel’s Itanium used this technique (static multiple issue) but called it

EPIC (Explicitly Parallel Instruction Computer)

Example 2-way VLIW machine

• One issue slot for INT/BRANCH operations & another for LD/ST instructions

• I-Cache reads out an entire issue packet (more than 1 instruction)

• HW is added to allow many registers to be accessed at one time – Just more multiplexers

• Address Calculation Unit (just a simple adder)

I-CacheD-Cach

ALUReg.File(4

Read, 2 Write)

dr.Calc.

Issue Packet =

More than 1 instruction

I-Cache

D-Cache

Reg.File

AddrCalc

INT/BRANCH

add %rcx,%rax

ld 8(%rdi),%rdx

2-way VLIW Scheduling

1. No forwarding w/in an issue packet (between instructions in a packet)

2. Full forwarding to previous instructions– Those behind in the pipeline

3. Still 1 stall cycle necessary when LD is followed by a dependent instr.

I-CacheD-Cach

Reg.File

AddrCalc

VLIW (issue

packet)

add %rcx,%rax

st %rax,0(%rdi)

sub %rax,%rbx

ld 0(%rdi),%rcx

or %rcx,%rdx

ld 0(%rdi),%rcx

This Photo by Unknown Author is licensed under CC BY-SA

Sample Scheduling

• Schedule the following loop body on our 2-way static issue machine

# %rdi = A# %esi = n = # of iterationsL1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) add $4,%rdi add $-1,%esi jne $0,%esi,L1

void f1(int *A, int n) { for( ; n != 0; n--, A++) *A += 5; }

Int./Branch Slot LD/ST Slot

ld 0(%rdi),%r9

add $-1,%esi

add $5,%r9

add $4,%rdi st %r9,0(%rdi)

jne $0,%esi,L1

add $-1,%esi ld 0(%rdi),%r9

add $4,%rdi

add $5,%r9

jne $0,%esi,L1 st %r9,-4(%rdi)

w/o modifying original code but with code movementIPC = 6 instrucs. / 5 cycles = 1.2

w/ modifications and code movementIPC = 6 instrucs. / 4 cycle = 1.5

Annotated Example

I-CacheD-Cach

ALUReg.File(4

Read, 2 Write)

Issue Packet =

More than 1 instruction

I-Cache

D-Cache

Reg.File

AddrCalc

INT/BRANCH

add $4,%rdi

st %r9,0(%rdi)

ld 0(%rdi),%r9

add $-1,%esiadd $5,%r9

jne $0,%esi,L1

nop nop

ld 0(%rdi),%r9

add $-1,%esi

add $5,%r9

jne $0,%esi,L1

Loop Unrolling

• Often not enough ILP w/in a single iteration (body) of a loop

• However, different iterations of the loop are often independent and can thus be run in parallel

• This parallelism can be exposed in static issue machines via loop unrolling– Copy the body of the loop k times and iterate only n / k times

– Instructions from different body iterations can be run in parallel

// Loop unrolled 4 times void f1(int *A, int n) { // assume n is a multiple of 4 for( ; n != 0; n-=4, A+=4) { *A += 5; *(A+1) += 5; *(A+2) += 5; *(A+3) += 5; }}

Loop Unrolling

# %rdi = A# %esi = n = # of iterationsL1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) add $4,%rdi add $-1,%esi jne $0,%esi,L1

// Loop unrolled 4 timesfor(i=0; i < MAX; i+=4){ A[i] = A[i] + 5; A[i+1] = A[i+1] + 5; A[i+2] = A[i+2] + 5; A[i+3] = A[i+3] + 5;}

# %rdi = A# %esi = n = # of iterationsL1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) ld 4(%rdi),%r9 add $5,%r9 st %r9,4(%rdi) ld 8(%rdi),%r9 add $5,%r9 st %r9,8(%rdi) ld 12(%rdi),%r9 add $5,%r9 st %r9,12(%rdi) add $16,%rdi add $-4,%esi jne $0,%esi,L1

Original Code

A side effect of unrolling is the reduction of overhead instructions (less

branches and counter/ptr. updates

Unrolled Code

Code Movement & Data Hazards• To effectively schedule the code, the compiler will often move

code up or down but must take care not to change the intended program behavior

• Must deal with WAR (Write-After-Read) and WAW (Write-After-Write) hazards in addition to RAW hazards when moving code– WAW and WAR hazards are not TRUE hazards (no data communication

between instrucs.) but simply conflicts because we want to use the same register…we call them name dependencies or antidependences!

– How can we solve? Register renaming.

L1: add %r8, %r9 add %r9, %r10 ld 0(%r11),%r9

LD instruction is REALLY independent (only needs %r11).Could LW instruction be moved between the 2 add’s?Not as is, WAR hazard

L1: add %r8, %r9 add %r9, %r10 ld 0(%r11), %r9 sub %r9, %r12

Could LD instruction be run in parallel with or before first add?Not as is, WAW hazard

L1: add %r8, %r9 ld 0(%r11), %r9 add %r9, %r10

read %r9

write %r9

Original

ProposedWrong %r9 used

L1: ld 0(%r11), %r9 add %r8, %r9 add %r9, %r10 sub %r9, %r12

Proposed

Original

Wrong %r9 used

write %r9

Register Renaming

• Unrolling is not enough because even though each iteration is independent there are conflicts in the use of registers (%r9 in this case)– Can’t move another 'ld' instruction up until

'st' is complete due to a WAR hazard even though there is not a true data dependence

• Since there is no true dependence (ld does not need data from 'st' or 'add' above) we can solve the problem by register renaming

• Register Renaming: Using different registers to solve WAR / WAW hazards

# %rdi = A# %esi = n = # of iterationsL1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) ld 4(%rdi),%r9 add $5,%r9 st %r9,4(%rdi) ld 8(%rdi),%r9 add $5,%r9 st %r9,8(%rdi) ld 12(%rdi),%r9 add $5,%r9 st %r9,12(%rdi) add $16,%rdi add $-4,%esi jne $0,%esi,L1

? read %r9

write %r9

Scheduling w/ Unrolling & Renaming

• Schedule the following loop body on our 2-way static issue machine# %rdi = A# %esi = n = # of iterationsL1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) ld 4(%rdi),%r10 add $5,%r10 st %r10,4(%rdi) ld 8(%rdi),%r11 add $5,%r11 st %r11,8(%rdi) ld 12(%rdi),%r12 add $5,%r12 st %r12,12(%rdi) add $16,%rdi add $-4,%esi jne $0,%esi,L1

ld 0(%rdi),%r9

add $-4,%esi ld 4(%rdi),%r10

add $5,%r9 ld 8(%rdi),%r11

add $5,%r10 ld 12(%rdi),%r12

add $5,%r11 st %r9,0(%rdi)

add $5,%r12 st %r10,4(%rdi)

jne $0,%esi,L1 st %r12,-4(%rdi)

w/ Loop Unrolling and Register Renaming(Notice how the compiler would have to modify the code to

effectively reschedule)

IPC = 15 instrucs. / 8 cycle = 1.875

Scheduling w/ Unrolling & Renaming

• Schedule the following loop body on our 2-way static issue machine# %rdi = A# %esi = n = # of iterationsL1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) ld 4(%rdi),%r10 add $5,%r10 st %r10,4(%rdi) ld 8(%rdi),%r11 add $5,%r11 st %r11,8(%rdi) ld 12(%rdi),%r12 add $5,%r12 st %r12,12(%rdi) add $16,%rdi add $-4,%esi jne $0,%esi,L1

Data Dependency Hazards Summary

• RAW = Only real data dependency– Must be respected in terms of code movement and ordering

– Forwarding reduces latency of dependent instructions

• WAW and WAR hazards = Anti-dependencies– Solved by using register renaming

• RAR = No issues / dependencies

Loop Unrolling / Register Renaming

• Loop unrolling increases code size(memory needed to store instructions)

• Register renaming burns more registers and thus may require HW designers to add more registers

• Must have some independence between loop bodies– Dependence between iterations known as loop carried dependence

// Dependence between iterations A[0] = 5;for (i=1; i < MAX; i++) { A[i] = A[i-1] + 5;}

Memory Hazard Issue• Suppose %rsi and %rdi are passed in as arguments to a

function, can we reorder the instructions below?– No, if %rsi == %rdi we have a data dependency in memory

• Data dependencies can occur via memory and are harder for the compiler to find at compile time forcing it to be conservative

ld 0(%rsi),%edx addl $5,%edx st %edx,0(%rsi)

ld 0(%rdi),%eax addl $1,%eax st %eax,0(%rdi)

We moved the 2nd ‘ld’ up to enhance performance... Is it equivalent? No…Need to wait for ‘st’ !!

ld 0(%rsi),%edx

ld 0(%rdi),%eax

addl $5,%edx

addl $1,%eax st %edx,0(%rsi)

st %eax,0(%rdi)

Can we reorder these instructions?

Memory Disambiguation

• Data dependencies occur in memory and not just registers!

• Memory RAW dependencies are also made harder because of different ways of addressing the same memory location– Can the following be reordered?

st %eax, 4(%rdi)ld -12(%rsi), %ecx

– No! What if %rsi = %rdi + 16

• Memory disambiguation refers to the process of determining if a sequence of stores and loads references the same address (ordering often needs to be maintained)

• We can only reorder LD and ST instructions if we can disambiguate their addresses to determine any RAW, WAR, WAW hazards– LD -> LD is always fine (RAR)

– ST -> LD (RAW), LD -> ST (WAR) or ST -> ST (WAW) hazards that need to be disambiguated

Itanium 2 Case Study

• Max 6 instruction issues/clock

• 6 Integer Units, 4 Memory units, 3 Branch, 2 FP– Although full utilization is rare

• Registers– (128) 64-bit GPR’s

– (128) FPR’s

• On-chip L3 cache (12 MB of cache memory)

I won’t be surprised at all if the whole multithreading idea turns out to be a flop, worse than the "Itanium" approach that was supposed to be so terrific—until it turned out that the wished-for compilers were basically impossible to write.

Don Knuthhttps://www.informit.com/articles/article.aspx?p=1193856

Static Multiple Issue Summary

• Compiler is in charge of reordering, renaming, unrolling original program code to achieve better performance

• Processor is designed to fetch/decode/execute multiple instructions per cycle in the order determined by the compiler

• Pros: HW can be simpler and thus faster/smaller– More cores

– Potentially higher clock rates

• Cons: Requires recompilation– No support for legacy software

Exercise: 2-way VLIW Scheduling

void f1(int *A, int *B, int N) {

int temp = *A;

*A = temp + *B + 9;

*B = temp;

A--; B--; N--;

} while (N != 0);

ld (%rdi),%eax ; load temp=*A

ld (%rsi),%ebx ; load *B

add %eax,%ebx ; add temp+*B

add $9,%ebx ; add 9

st %ebx,(%rdi) ; store *A

st %eax,(%rsi) ; store *B

add $-4,%rdi ; dec. A ptr.

add $-4,%rsi ; dec. B ptr.

add $-1,%edx

jne $0,%edx,.L1 ; loop

=== INTEGER SLOT ===

// nop

add %eax,%ebx

add $9,%ebx

add $-4,%rdi

add $-4,%rsi

add $-1,%edx

jne $0,%edx,.L1

=== LD/ST SLOT ===

ld (%rdi),%eax

ld (%rsi),%ebx

st %ebx,(%rdi)

st %eax,(%rsi)

Unoptimized Schedule

Exercise 1: You can move or modify code, but cannot apply loop unrolling or register renaming.

Solution: 2-way VLIW Scheduling

=== INTEGER SLOT ===

// nop

add %eax,%ebx

add $9,%ebx

add $-4,%rdi

add $-4,%rsi

add $-1,%edx

jne $0,%edx,.L1

=== LD/ST SLOT ===

ld (%rdi),%eax

ld (%rsi),%ebx

st %ebx,(%rdi)

st %eax,(%rsi)

Unoptimized Schedule=== INTEGER SLOT ===

add $-4,%rdi

add $-4,%rsi

add $-1,%edx

add %eax,%ebx

add $9,%ebx

jne $0,%edx,.L1

=== LD/ST SLOT ===

ld (%rdi),%eax

ld (%rsi),%ebx

st %eax,4(%rsi)

st %ebx,4(%rdi)

Move Up and Modify Offsets

IPC = 10 instructions / 6 clocks = 1.67

Note: ● Modified offset for st● Intermediate instruction between

load into %ebx and its use by add

Exercise 2: You can unroll the loop once (2 iterations) with register renaming.

Unrolling the loop with register renaming

ld (%rdi),%eax ; load temp=*A

ld (%rsi),%ebx ; load *B

add %eax,%ebx ; add temp+*B

add $9,%ebx ; add 9

st %ebx,(%rdi) ; store *A

st %eax,(%rsi) ; store *B

ld -4(%rdi),%r8d ; 2nd iter

ld -4(%rsi),%r9d ;

add %r8d,%r9d ;

add $9,%r9d ;

st %r9d,-4(%rdi) ;

st %r8d,-4(%rsi) ;

add $-8,%rdi ; dec. A ptr.

add $-8,%rsi ; dec. B ptr.

add $-2,%edx

jne $0,%edx,.L1 ; loop

Loop Unrolling / Register Renaming=== INTEGER SLOT ===

add $-8,%rdi

add $-8,%rsi

add $-2,%edx

add %eax,%ebx

add $9,%ebx

add %r8d,%r9d

add $9,%r9d

jne $0,%edx,.L1

Move Up and Modify Offsets

IPC = 16 instructions / 8 clocks = 2

Note: intermediate instructions between loads and uses of a register.

=== LD/ST SLOT ===

ld (%rdi),%eax

ld (%rsi),%ebx

ld 4(%rdi),%r8d

ld 4(%rsi),%r9d

st %eax,8(%rsi)

st %ebx,8(%rdi)

st %r8d,4(%rsi)

st %r9d,4(%rdi)

Reversed%rsi / %rdi

IncreasedOffset

Exercise: Solve WAR/WAW hazards

Solve WAR/WAW hazards of the following code through renaming.

ld 0(%rdi),%rax

add %rcx,%rax

sub %rbx,%rcx

ld 0(%rsi),%rbx

sub %rsi,%rbx

add %rbx,%rbx

ld 0(%rdi),%rax

add %rcx,%rax

sub %rbx,%rcx

ld 0(%rsi),%r8

sub %rsi,%r8

add %r8,%r8

STATIC SCHEDULING EXAMPLE

DYNAMIC MULTIPLE ISSUE MACHINES

HW-based solutions

Overcoming the Memory Latency

• What happens to instruction execution if we have a cache miss?– All instructions behind us need to stall!

– Could take potentially hundreds of clock cycles to fetch the data

• Can we overcome this?

I-CacheD-Cach

ALUReg.File(4

Read, 2 Write)

I-Cache

D-Cache

Reg.File

AddrCalc

ld 0(%rdi),%rcx

Out-Of-Order Execution

• Idea: Have processor find dependencies as instructions are fetched/decoded and execute independent instructions that come after stalled instructions– Known as Out-of-Order Execution or Dynamic Scheduling

– HW will determine the "dependency" graph at runtime and as long as an instruction isn't waiting for an earlier instruction, let it execute!

# %rdi = A# %esi = n = # of iterations# %rdx = sf1: ld 0(%rdx),%r8 add $1,%r8 st %r8,0(%rdx)L1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) add $4,%rdi add $-1,%esi jne $0,%esi,L1

void f1(int *A, int n, int *s) { *s += 1; for( ; n != 0; n--, A++) *A += 5; }

ld 0(%rdx),%r8

add $1,%r8

st %r8,0(%rdx)

ld 0(%rdi),%r9

add $5,%r9

st %r9,0(%rdi)

True (RAW or control) dependence

CACHE MISS

independent

OoO Execution: Tomasulo's Algorithm

I-Cache

Block Diagram Adapted from Prof. Michel Dubois

(Simplified for CS356)

Register Status Table

Integer / Branch D-Cache Div Mul

Instruc. Queue

. File

Common Data Bus

Issue Unit

Dispatch

Fetch multiple instructions per clock cycle in PROGRAM ORDER (i.e. normal order generated by the compiler)

Decode & dispatch multiple instructions per cycle tracking dependencies on earlier instructions

Instructions wait in queues until their respective functional unit (the hardware that will compute their value) is free AND they have their data available (from the instructions they depend upon).

Results of multiple instructions can be written back per cycle. Results are broadcast to any instruction waiting for that result.

Tracks which instruction is the latest producer of a register. (Helps track the dependencies)

I-Cache

2:addl $1,[res1]

3:st [res2],0(%rdx)

1:ld 0(%rdx),%r8

Instruc. Queue

. File

Common Data Bus

Issue Unit

DispatchInstructions wait in queues until their respective functional unit (the hardware that will compute their value) is free AND they have their data available (from the instructions they depend upon).

Assume we can dispatch 3 instructions per cycle

I-Cache

5:addl $5,[res4]

2:addl $1,[res1]

6:st [res5],0(%rdi)

4:ld 0(%rdi),%r9

3:st [res2],0(%rdx)

1:ld 0(%rdx),%r8

Instruc. Queue

. File

Common Data Bus

Issue Unit

STALLED

Assume we can dispatch 3 instructions per cycle

I-Cache

5:addl $5,[res4]

2:addl $1,[res1]

6:st [res5],0(%rdi)

4:ld 0(%rdi),%r9

3:st [res2],0(%rdx)

1:ld 0(%rdx),%r8

Instruc. Queue

. File

Common Data Bus

Issue Unit

STALLED

#4 LD 0(%rdi),%r9 has all its data and thus can jump ahead of the other stalled LD and the ST that is dependent on that LD. It executes out of order.

I-Cache

8: add $-1,%esi

7: add $4,%rdi

5:addl $5,[res4]

2:addl $1,[res1]

6:st [res5],0(%rdi)

3:st [res2],0(%rdx)

1:ld 0(%rdx),%r8

Instruc. Queue

. File

Common Data Bus

Issue Unit

STALLED

Once the #4 LD reads from cache (assume cache hit) it will broadcast its value and the dependent #5 ADD will pick it up and be able to execute.Meanwhile, the next instructions can be dispatched

Result of [4: ld 0(%rdi),%r9]

I-Cache

2:addl $1,[res1]

6:st [res5],0(%rdi)

3:st [res2],0(%rdx)

1:ld 0(%rdx),%r8

Instruc. Queue

. File

Common Data Bus

Issue Unit

STALLED

Once the #5 ADD computes its sum it will broadcast its value and the dependent #6 ST will pick it up and be able to execute.Meanwhile other ADDs can execute and complete.

Result of [7: addl $4,%rdi]

8: add $-1,%esi

7: add $4,%rdi

I-Cache

2:addl $1,[res1]

3:st [res2],0(%rdx)

Instruc. Queue

. File

Common Data Bus

Issue Unit

MISS RESOLVED

When the cache finally resolves the miss, #1 LD will broadcast its value and the dependent #2 ADD will pick it up and be able to execute.From there the #3 ST will be able to execute next.

Result of [1: ld 0(%rdx),%r8]

Dynamic Multiple Issue

• Burden of scheduling code for parallelism is placed on the HW and is performed as the program runs (not necessarily at compile time)– Compiler can help by moving code, but HW guarantees correct

operation no matter what

• Goal is for HW to determine data dependencies and let independent instructions execute even if previous instructions (dependent on something) are stalled– We call this a form of Out-of-Order Execution

Problems with OoO Execution

• What if an exception (e.g. page fault) occurs in an earlier instruction AFTER later instructions have already completed– OS will save the state of the program and handle the page miss

– When OS resumes it will restart the process at the ST instruction

– The subsequent instructions will execute for a 2nd time. BAD!!!

• We need to fetch and dispatch multiple instructions per cycle but when we hit a jump/branch, we don't know which way to fetch!

• Solution: Speculative Execution with ability to "Rollback"

# %rdi = A# %esi = n = # of iterations# %rdx = sf1: ld 0(%rdx),%r8 add $1,%r8 st %r8,0(%rdx)L1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) add $4,%rdi add $-1,%esi jne %r8,%esi,L1 next instruc.

Completed

Resume after STALLING

Page Fault!

Where next

SPECULATIVE EXECUTION

Speculation w/ Dynamic Scheduling

• Basic block size of 5-7 instructions between branches limits ability to issue/execute instructions – For safety, we might consider stalling (stop

dispatching instructions) until we know the outcome of a jump/branch

• But speculation allows us to predict a branch outcome and continue issuing and executing down that path

• Of course, we could be wrong so we need the ability to roll-back the instructions we should not have executed if we mispredict

• We add a structure known as the commit unit (or re-order buffer / ROB)

Basic B

ld 0(%r8),%r9 and %r10,%r11L1: add %r8,%r12 or %r11,%r13 sub %r14,%r10 jeq %r12,%r14,L1 xor %r10,%r15

STALL until we know outcome

Cache Miss

Out-Of-Order Diagram

In-OrderIssue Logic( > 1 instruc.

per clock)

INT MUL/DIV

INT ALU

FPMUL/DIV

Load/Store

tOut-of-Order

In-Order

Results forwarded to dependent (RAW) instructions

Writeback(to D$ or registers)

Re-Order Buffer(ROB)

Temporarily buffers results of instructions

until earlier instructions complete and writeback

IN ORDER

Queues for functional

Completed instructions waiting to complete in-order

Instruc. i+n = newest

Instruc. i = oldest

Front End

Back End

Commit Unit (ROB)

• ROB tail entry is allocated for each instruction on issue (ensuring instruction entries are maintained in program order)

• When an instruction completes, its results are forwarded to others but is also stored in the ROB (rather than writing back to reg. file) until it reaches the head of the ROB

• Commit unit only commits the instructions at the head of the queue and only when they are fully completed– Ensures that everything committed was correct

– If we hit an exception or misspeculate/mispredict a branch then throw away everyone behind it in the ROB and start fresh using the correct outcome

XOR / ADD ????

LDANDADDSUB

(Orange): Executed instrucs. and waiting to write back

(Gray): Stalled or not yet executed

ld 0(%r8),%r9 and %r9,%r11L1: add %r8,%r12 sub %r8,%r10 st %r9,0(%r13) jeq %r9,%r14,L1 xor %r10,%r15

Commit Unit (ROB)

• What happens if the ST instruction that is STALLED ends up causing a page fault… – ROB allows us to throw away instructions

after it and replay them after the page fault is handled

• When we get to the JEQ, we don't know %r9 so we'll just guess (predict) the outcome and fetch down that predicted path– If we mispredict, ROB allows us to throw

away instruction results

XOR / ADD ????

LDANDADDSUB

ld 0(%r8),%r9 and %r9,%r11L1: add %r8,%r12 sub %r8,%r10 st %r9,0(%r13) jeq %r9,%r14,L1 xor %r10,%r15

Exception!

Mispredicted

Branch Prediction + Speculation

• To keep the backend fed with enough instructions we need to predict a branch's outcome and perform "speculative" execution beyond the predicted (unresolved) branch– Roll back mechanism (flush) in case of misprediction

Conditional branchesBasic Block

Head of ROB

Speculative Execution

NT-path T-path

Speculation Example

• Predict branches and execute most likely path– Simply flush ROB entries after

the mispredicted branch

– Need good prediction capabilities to make this useful

ROB Head(Assume stall)

Spec. Path

Time 2b: Flush ROB/Pipeline of instructions behind it

Time 1: ROBRed Entries = Predicted

Branches

Time 3: ROBPipeline begins to fill w/

correct path

Time 2a: ROBBlack Entry = Mispredicted

branch

Correct Path

CS356: Meltdown & SpectreOr... The Dangers of Caches and Speculation

Marco Paolieri (paolieri@usc.edu)Illustrations from meltdownattack.com

Virtual memory isolates processes, right?

Programs are not permitted to read data from other programs… and yet:

● Meltdown allows a program to access the memory (and secrets) of other programs and the operating system. It works in the cloud too!

● Spectre allows an attacker to trick error-free programs, which follow best practices, into leaking their secrets. Safety checks of said best practices actually increase the attack surface and may make applications more susceptible to Spectre. Harder to exploit but also harder to mitigate.

Wait, but:● Am I affected by the vulnerability? Most certainly, yes.● Where’s the problem? Out-of-order execution (Intel since 1995, some ARM)● Can I detect if someone has exploited them against me? Probably not.● Can my antivirus detect attackers? Unlikely until a binary becomes known.● What can be leaked? Anything in your RAM.● Is there a fix? OS patches against Meltdown, compiler patches for Spectre.

In practice: reading other programs’ memory

Meltdown: Melts HW Security Boundaries

Isolation between kernel and user processes● Supervisor bit in the CPU is set when entering kernel code.● Kernel memory pages can be read only then the supervisor bit is set.● No change of memory mapping between kernel mode and user mode.

● Meltdown uses side-channel information leaked by out-of-order execution to read kernel virtual memory.

● In the kernel virtual memory, there is a region that gives access to all of the physical RAM.

● It’s not a bug of the OS, but a problem of modern CPUs. All operating systems are affected.

So, what’s the problem with out-of-order?

Out-of-Order and Speculative Execution

Out of Order. While the execution unit (ALU, LD, ST) of the current instruction is busy, other instructions can run ahead if resources and data are available.

Speculative Execution. Let the CPU process instructions even when it’s not certain that they will be needed. If they are not needed, just undo their effects (rollback); otherwise commit.● Example: instructions after a branch

How? Tomasulo’s algorithm (1967)● Queue instructions in scheduler● Connect all units with CDB● Run them when data is available● Rollback if order was wrong

Cache Side-Channel Attacks: Flush+Reload

Flush+Reload1. Attacker frequently flushes the shared last-level cache (clflush)2. Checking the time to read from an address, the attacker determines

whether data within the same line was read by the user.

Variant: Evict+Reload. Instead of clflush, force eviction through many reads.

Covert ChannelThe attacker flushes the last-level cache and tries to access an address. Why?To let some information leak from one security domain to another.

An example: guessing the value of data

● After the exception, the flow continues in the kernel.● The exception handler terminates the program.● But due to out-of-order execution, the array access is performed…

… and rolled back later when the exception happens.But the effects on the cache are still present… a cache line was read.

Guessing values of kernel data

Transient InstructionsCompleted but not committed(they introduce a side channel)

● Accessing kernel data from user-space causes an exception.

● If we have instructions that use that data, they will certainly be rolled back.

● Can we leave a trace of the data in the state of the microarchitecture? Like, in the cache?

Avoiding the exceptionThe exception will terminate your attack program. Solutions:● Do the illegal access from a child process, recover secret data from parent● Trick the branch predictor into thinking that the illegal code should run, when

it won’t: this schedules your code, without an exception.

Sending data through the covert channel

Very simple!

● Fix a monitored memory address x● The transient instruction reads kernel data and transmits it 1 bit at a time

○ Flush the cache; if the bit is 1, read from x● The recovery code (in user space) reads from x

○ If the read is fast, the value was 1

Using many addresses in different cache lines, we can transmit multiple bits at a time.

Meltdown can transmit data at 500 kB/s. Not bad!

Kernel Address Space Layout RandomizationSame as ASLR, but instead of randomizing the stack position, randomize the position of the kernel virtual memory.

Enabled by default in Linux 4.12

The location of the physical memoryrange is also randomized… ● but only by 40 bits● it takes 128 tests to cover 8 GB

So, it can be circumvented.

Linux KAISER Patch

Stronger isolation between userspace and kernel spaceDon’t map kernel data in user-space virtual memory(except for a few things like interrupt handlers)

Now KASLR is usefulThere is little kernel dataIt’s hard to guess where it is

Meltdown doesn’t work on AMDProbably because access bits are checked in the page table before LD/ST

Out of Order Execution + Side Channels

4-Step Recipe for a Meltdown

1. Vulnerable out-of-order CPU allows an unprivileged process to load data from a privileged address (kernel and all physical RAM) into a register.

2. The CPU performs computations based on the value of this register (e.g., use the register as array index to read from memory).

3. If it turns out that we didn’t have the right to read the data, the results of these operations are thrown away.

4. Seems safe but… out-of-order memory lookups influence cache hit/miss: from this side channel, we can find out what the value was in the register.

Read the original paper!https://arxiv.org/abs/1801.01207

Spectre: The Risks of Speculative Execution

1. Locate or introduce a sequence of instructions within the process address space which, when executed, act as a covert channel transmitter leaking the victim’s memory or register contents.

2. Trick the CPU into speculatively and erroneously executing the instructions3. Retrieve information from covert channel.

How to trick the CPU into wrong speculations?

● Mistrain the conditional branch predictor.For example: if there’s an if guard, invoke it many times with valid data.Then pass invalid data… the branch predictor will take the “is valid” branch

● Mistrain the indirect branch predictor.For example: the attacker jumps many times to the virtual address of a gadget in the attacked program. The attacked program will jump too...

Spectre with branch misprediction

Idea● Suppose that we can choose x● First use many valid x values● Then, pick x out of bounds to read

secret data of attacked program● If array1_size, array2 not cached:

predictor will make a wrong guess● Speculative execution causes a read

on page number equal to secret● We can scan all the pages and check

what the secret was from access times

It works in the browser too!

● Cannot use clflush for Flush+Reload, must use Evict+Reload● The clflush avoids optimizations, |0 triggers optimizations● The secret n will be used to access probeTable[n*4096]● Timing errors are possible due to other processes, but error rate is 0.005%

Processors at risk

Most● Intel

○ Ivy Bridge○ Haswell○ Broadwell○ Skylake○ Kaby Lake

● AMD Ryzen● ARM-based Samsung and Qualcomm processors

(And it works with native code or even sandboxed Javascript.)

Difference with respect to Meltdown● Meltdown does not mistrain branch prediction● Meltdown bypasses memory protection between userspace and kernel● Spectre does not: it just makes the CPU jump to the wrong place and run

instructions there to leak information. Cannot be fixed with KAISER.

Countermeasures

Preventing speculative executionVery expensive!

Preventing access to secret data● Run websites in separate processes● Replace array bounds checking with indexed masking to ensure locality● Pointer poisoning: (pointer XOR secret)

Only with secret you can recover/follow the pointer, and branch predictions use different statistics.

Prevent data to be leaked through cachesNot possible with current processors

Read the original paper!https://arxiv.org/abs/1801.01203

BONUS MATERIALNot responsible for this material

BRANCH PREDICTION

Branch Prediction

• Since basic blocks are often small, multiple issue (static or dynamic) processors may encounter a branch every 1-2 cycles

• We not only need to know the outcome of the branch but the target of the branch– Branch target: Branch target buffer (cache)

– Branch outcome: Static (compile-time) or dynamic (run-time / HW assisted) prediction techniques

• To keep the pipeline full and make speculation efficient and not wasteful, the processor needs accurate predictions

Branch Target Availability• Branches perform PC = PC + displacement where displacement is stored as

part of the instruction

• Usually can’t get the target until after the instruction is completely fetched (displacement is part of instruction)– May be 2-3 cycles in a deeply pipelined processor

(ex. [I-TLB, I-Cache Lookup, I-Cache Access, Decode]

– If a 4-way superscalar and 3 cycle branch penalty, we throw away 12 instructions on a misprediction

• Key observation: Branches always branch to same place (target is constant for each branch)

PC I-TLB

AccessI-$

Look-upP

I-$AccessP

Decode

Branch instruction still being fetched

Branch instruction available here

Branch target (PC+d) available here

Would like to have branch target available in 1st stage

Finding the Branch Target

• Key observation: Jump/branches always branch to same place (target is constant for each branch)– The first time we fetch a jump we'll have no idea where it is going to jump to and thus

have to wait several cycles

– But let's save the address where the jump instruction lives AND where it wants to jump to (i.e. the target)

– Next time the PC gets to the starting address of the jump we can lookup the target quickly

– Keep all this info in a small "branch target cache/buffer"

00000000004004d6 <sum>: 4004d6: 85 f6 test %esi,%esi 4004d8: 7e 1d jle 4004f7 <sum+0x21> 4004da: 48 89 fa mov %rdi,%rdx 4004dd: 8d 46 ff lea -0x1(%rsi),%eax 4004e0: 48 8d 4c 87 04 lea 0x4(%rdi,%rax,4),%rcx 4004e5: b8 00 00 00 00 mov $0x0,%eax 4004ea: 03 02 add (%rdx),%eax 4004ec: 48 83 c2 04 add $0x4,%rdx 4004f0: 48 39 ca cmp %rcx,%rdx 4004f3: 75 f5 jne 4004ea <sum+0x14> 4004f5: eb 05 jmp 4004fc <sum+0x26> 4004f7: b8 00 00 00 00 mov $0x0,%eax 4004fc: c6 05 36 0b 20 00 01 movb $0x1,0x200b36(%rip) 400503: c3 retq

Jump Instruc. Address Jump Target

0x4004d8 0x4004f7

0x4004f3 0x4004ea

0x4004f5 0x4004fc

Branch Target Buffer

• Idea: Keep a cache (branch target buffer / BTB) of branch targets that can be accessed using the PC in the 1st stage– Cache holds target addresses and is accessed using the PC (address of

instruction)

– First time a branch is executed, cache will miss, and we’ll take the branch penalty but save its target address in the cache

– Subsequent accesses will hit (until evicted) in the BTB and we can use that target if we predict the branch is taken.

• Note: BTB is a “fully-associative” cache (search all entries for PC match)…thus it can’t be very large

PC I-TLB Access

BTBTarget PC

Branch Outcome Prediction

• Now that we have predicted the target, we now need to predict the outcome

• Static prediction– Have compiler make a fixed guess and put

that as a "hint" in the instruction itself

– Effective for loops

• Dynamic prediction– Some jumps are data dependent (e.g. if(x < y))

– Keep some "history"/records of each branches outcomes from the past & use that to predict the future

– Store that history in a cache

– Questions:• What history should we use to predict a branch

• How much history should we use/keep to predict a branch

Loop Body

Loop Check

“After” Code

PC I-TLB Access

BTB Target PC(PC+d)

Predictors Outcome(T/NT)

Local vs. Global History

• What history should we look at? – Should we look at just the previous

executions of only the particular branch we’re currently predicting or at surrounding branches as well

• Local History: The previous outcomes of that branch only– Usually good for loop conditions

• Global History: The previous outcomes of the last m branches in time (other previous branches)

do { i--; if(x == 2) { … } if(y == 2) { … } if(x == y) { … } // Better: Local or Global }while (i > 0); // Better: Local or Global

Global (Correlating) Predictor

• Use the outcomes of the last m branches that were executed to select a prediction– Given last m jumps, 2m possible

combinations of outcomes & thus predictions• When jeq1=NT and jeq2=NT, predict jne = T,

when jeq1=NT and jeq2=T, predict jne = NT,etc.

– Branch predictor indexedby concatenating LSB’sof PC and m-bits of lastm branch outcomes

je ...,L1 (T = 1) ...je ...,L2 (NT = 0) ...jne ...,L3 // use history // of NT,T to // predict

PC I-TLB Access

PredictorsH

rym-bits Concatenate

Tournament Predictor

• Dynamically selects when to use the global vs. local predictor– Accuracy of global vs. local

predictor for a branch may vary for different branches

– Tournament predictor keeps the history of both predictors (global or local) for a branch and then selects the one that is currently the most accurate

Tournament Selector

Local Prediction

Global Prediction

Predictor exhibiting greatest accuracy

Dynamic Scheduling Summary

• You can understand a modern architecture– https://en.wikichip.org/wiki/intel/microarchitectures/haswell_(client)

• Software implications– Code with a lot of branches will perform worse

than regular code

– Many cache misses will limit the performance

• But compared to a statically scheduled processor…

Pros/Cons of Static vs. Dynamic

• Static– HW can be simpler since compiler schedules code

– Compiler can see “deeper” in the code to find parallelism

– Used in many high-performance embedded processors like GPUs, etc. where code is more regular with high computation demand

• Dynamic– Allows for performance increase even for legacy software

– Can be better at predicting unpredictable branches

– HW structures do not scale well (ROB, reservation stations, etc.) beyond small sizes and more waste (time & power)

– Better for unpredictable, general purpose control code

PHYSICAL VS ARCHITECTURAL REGISTERS

Virtual Registers?

• In static scheduling, the compiler accomplished register renaming by changing the instruction to use other programmer-visible GPR’s

• In dynamic scheduling, the HW can "rename" registers on the fly– In the code on the left we would want %r9 to be

renamed to %r10, %r11, for each iteration

• Solution: A level of indirection– Let the register numbers be "virtual" and then

perform translation to a "physical" register

– Every time we write to the same register we are creating a "new version" of the register…so let's just allocate a physically different register

# %rdi = A# %esi = n = # of iterations# %rdx = sf1: ld 0(%rdx),%r8 addl $1,%r8 st %r8,0(%rdx)L1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) add $4,%rdi add $-1,%esi jne $0,%esi,L1L1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) add $4,%rdi add $-1,%esi jne $0,%esi,L1L1: ld 0(%rdi),%r9 add $5,%r9 st %r9,0(%rdi) add $4,%rdi add $-1,%esi jne $0,%esi,L1

Trace of instructions over 3 loop iterations. Each iteration

is independent if we can rename %r9

Register Renaming

• Whenever an instruction produces a new value for a register, allocate a new physical register and update the table– Mark the old physical

register as "free"

– Mark the newly allocated register as "used"

• An instruction that wants to read a register just uses whatever physical register the current mapping table indicates

%r1%r0%r2

orig rsi

orig rdi

ld's r9%rdi%rsi%r9...

Physical Registers

Mapping Table%r5%r6%r7

%r0%r1%r2%r3%r4

%r1%r0

%r2 %r3

orig rsi

add's r9

%rdi%rsi%r9...

Physical Registers

%r0%r1%r2%r3%r4

%r1 %r4%r0 %r5

add's esi

ld's r9

orig rsi

%rdi%rsi%r9...

Physical Registers

%r0%r1%r2%r3%r4

orig rdi

ld's r9

orig rdi

ld's r9

add's r9

add's rdi

ld 0(%rdi),%r9

add $5,%r9

add $4,%rdi add $-1,%esi

…ld 0(%rdi),%r9

freeused

Architectural vs. Physical Registers

• Architectural registers = The (16) x86 registers visible to the programmer or compiler

– Truly just names ("virtual")

– The mapping table needs 1 entry per architectural register

• Physical registers = A greater number of actual registers than architectural registers that is used as a “pool” for renaming

• Often a large pool of physical registers (80-128) to support large number of instructions executing at once or waiting in the commit unit

%r1%r0%r2

orig rsi

orig rdi

ld's r9%rdi%rsi%r9...

Physical Registers

"Architectural Registers"

%r5%r6%r7

%r0%r1%r2%r3%r4

%r1%r0

%r2 %r3

orig rsi

add's r9

%rdi%rsi%r9...

Physical Registers

%r5%r6%r7

%r0%r1%r2%r3%r4

%r1 %r4%r0 %r5

add's esi

ld's r9

orig rsi

%rdi%rsi%r9...

Physical Registers

%r5%r6%r7

%r0%r1%r2%r3%r4

orig rdi

ld's r9

orig rdi

ld's r9

add's r9

add's rdi

ld 0(%rdi),%r9

add $5,%r9

add $4,%rdi add $-1,%esi

…ld 0(%rdi),%r9

freeused

CS356 Unit 12b · Register Renaming • Unrolling is not enough because even though each iteration...

Documents

DNU: Deep Non-Local Unrolling for Computational Spectral ...openaccess.thecvf.com/content_CVPR_2020/papers/Wang_DNU_Dee… · DNU: Deep Non-local Unrolling for Computational Spectral

Lee Elementary Renaming Submissions 2

Robert E. Lee Renaming Submissions

CS356: Discussion #8

Oregon Brand Products - ソーチェーン適用表 2020 · CS356/35RC91V, CS357/35RC91V 35.2 14" 35 3/8" .050" 53E 91PX/※VXL053E 541652 PS52 CS356/35RCN91F, CS357/35RCN91F 35.2

CS356 Unit 4 - USC Viterbi

CS356 Unit 13 - USC Viterbiee.usc.edu/~redekopp/cs356/slides/CS356Unit13_CodeOpt.pdf · 13.17 Latency and Throughput (Issue Time) Int. ALU, Addr. Calc. Front End Int ADD x4 FP ADD

Renaming in bridge

Lee Renaming Submissions

AD ID Renaming

BC Place renaming report

TWSS House Renaming Competition

1 Rolling-Unrolling LSTMs for Action Anticipation from ... · 1 Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video Antonino Furnari and Giovanni Maria Farinella

Loop Unrolling & Predication

Out-of-Order Execution & Register Renaming

CS356 Unit 4 - ee.usc.eduee.usc.edu/~redekopp/cs356/slides/CS356Unit4_x86_ISA.pdf · Memory vs. I/O Access • Processor performs reads and writes to communicate with memory and I/O

Segmentation of Parchment Scrolls for Virtual Unrolling

CS356: Discussion #4 · 2021. 3. 9. · CS356: Discussion #3 Floating points & Assembly Instructions. IEEE 754 Standard: 32-bit Binary32 Format (float)

CS307&CS356: Operating Systems

CS356: Operating System Project Report for Project 2 Android … · 2021. 1. 14. · CS356: Operating System Project . Report for Project 2 . Android Memory Management . Name: Ziteng