Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana...

Compiler Optimization Techniques

CP 7031

Dr.K.Thirunadana Sikamani

Principal Sources of OptimizationElimination of unnecessary instructions in object code ,

or the replacement of one sequence of instructions by a

faster sequence of instructions that does the same thing

is usually called “code improvement” or “code

optimization”

Redundancy

Semantic preserving transformations

Global Common Subexpressions

Copy Propagation

Dead Code Elimination

Code Motion8/25/2014 Compiler OptimizationTechniques - unit II 2

The Speed of a program run on a processor with Instruction Level Parallelism depends on

1. The potential parallelism in the program.

2. The available parallelism on the processor.

3. Our ability to extract parallelism from the original sequential program.

4. Our ability to find the best parallel schedule given scheduling constraints.

8/25/2014 3Compiler OptimizationTechniques - unit II

Processor Architecture

1. Instruction Pipelines and Branch delays

2. Pipelined Execution

3. Multiple Instruction Issues –VLIW ( Very Long Instruction Word)

Code Scheduling Constraints

1. Control-dependence constraints

2. Data-dependence Constraints

3. Resource Constraints

Control dependence constraints

All the operations executed in original

program must be executed in the optimized one

Data Dependence Constraints

The operations in the optimized program must produce the same results

as the corresponding ones in the original program

Resource Constraints

The schedule must not oversubscribe the resources on the

machine

Data Dependence

True dependence - Read after Write

Antidependence - Write after Read

Output dependence - Write after Write

Classify dependence for the following statements 1. a =b

2.c =d

3.b =c

4. d =a

5. c= d

6. a = b

8/25/2014 Compiler OptimizationTechniques - unit II 12

1 and 43 and 51 and 6

Check the dependences for the following

Give the register level m/c code to provide maxmparallelism also give the solution for minimal usage of register

expression ((u+v) + (w+x)) + (y+z)

LD r1,uLD r2,vADD r1,r1,r2LD r2,wLD r3,xADD r2,r2,r3ADD r1,r1,r2LD r2,yLDr3,zADD r2,r2,r3ADD r1,r1,r2

Clock1

LD r1,u

LD r2,v

LD r3,w

LD r4,x

LD r5,y

LD r6,z

Clock2

ADD r1,r1,r2

ADD r3,r3,r4

ADD r5,r5,r6

Clock3

ADD r1,r1,r3

clock4

ADD r1,r1,r5

Implementation of parallelism in 4 clocks

Finding dependences among memory Access1. Array data dependence analysis

for ( i = 0; i < n; i++)

A[2*i] = A[2* i+1]

2. Pointer alias analysis

Two pointers aliased if they refer to the same object

3. Inter procedural analysis It is to determine if same variable is passed as two or more

different arguments in passing parameters by reference language

Tradeoff between Register usage and Parallelisme.g., machine independent intermediate representation code

LD t1 , a

ST b , t1

LD t2 , c

ST d , t2

the code above is to copy the values of a and c to b and d . If all memory locations are distinct the copies can be proceed in parallel . The other case if t1 and t2 are assigned to use the same register to minimize the register usage.

Tradeoff between Register U sage and Parallelism

The syntax tree for the (a+b) + c + ( d+ e)

Machine code

LD r1 , aLDr2 , bADD r1,r1,r2LD r2 , cADD r1,r1,r2LD r2, dLD r3, eADD r2,r2,r3ADD r1,r1,r2

Parallel evaluation of the expression

r1 =ar6=r1+r2r8=r6+r3R9=r8+r7

r2=br7=r4+r5

r3=c r4=d r5=e

Phase Ordering between register allocation and Code Scheduling If registers are allocated before scheduling , the

resulting code tends to have many storage dependences that limit code scheduling.

On the other way around , the schedule created may require so many registers that register spilling

Spilling – storing the contents of a register in a memory location, so the register can be used for some other purpose.

Based on the characteristics of the program.

e.g., numeric , non numeric, etc.,

Control Dependence

If ( c ) s1; else s2; /* s1 and s2 are control dependent on c */

While ( c ) s; /* s is dependent on c */

if ( a > t )

b = a * a;

d = a + c; / * No dependence * /

Speculative Execution Support Prefetching - Bringing data from memory to cache

before it is used.

Poison Bits – Speculative load of data from memory to register file. Each register is augmented with poison bit. The poison bit is set when an illegal memory is accessed to raise exception at later usage.

Predicated Execution Predicated instructions were invented to reduce the

number of branches in a program.

A predicated instruction is like a normal instruction but has an extra predicate operand to guard its execution.

E.g., CMOVZ R2, R3, R1 has the semantics of moving contents of R3 to R2 if R1 is zero

if ( a == 0 ) b = c + d; can be implemented as

ADD R3 , R4 ,R5 /* a ,b,c ,d are allotedR1, R2 , R4,R5 */

CMOVZ R2, R3, R1

Basic Machine Model

Many machines can be represented as

M = < R , T >

T – Set of operation types T, such as loads, stores and arithmetic operations.

R is a vector – R = [ r1,r2,…..] are hardware resources.

r1 - number of units availabel of the ith kind of resources.

Resources – memory access units, ALUs, floating point functional units.

Basic Machine Model Each operation has a set of input operands , a set of

output operands and resource requirement

RTt– Resource –Reservation table

RTt[i,j]- is the number of units of jth resource used by an operation type t, i clocks after it is issued.

Basic-Block Scheduling

Data-Dependence Graphs

Graph G = ( N , E)

A set of nodes representing the operations in the machine instructions.

A set of directed edges representing the data dependence constraints among operations

1. Each operation n in N has a resource reservation table RTn , whose value is simply the resource – reservation table associated with operation type of n

2. Each edge e in E is labeled with delay de indicating that the destination node must be issued no earlier than de clocks after the source node is issued.

Data- dependence Graph

LD R2, 0(R1)

ST 4(R1), R2

ADD R3,R3,R2

ADD R3, R3, R4

Ld R3, 8 (R1)

ST 0(R7), R7

ST 12(R1), R3

1.Load operation takes 2 clock cycles2. R1 is a stack pointer having offset from 0 t0 12

List Scheduling of Basic Blocks This involves visiting each node of the

data-de pendence graph in “prioritized topological order”

Machine-resource vector R = [r1,r2,r3,..]

ri --- the number of units available of the ith kind of resource

G = ( N,E) data dependence graph

RTn ---- Resource -reservation table

Edge e = n1 n2 with de indicating n2

would be executed de delays after n1.

List Scheduling AlgorithmRT = An empty reservation table

for ( each n in N in prioritized topological order){s = max e=p ->n in E (S(p) + de);

/* find the earliest time this instruction this instruction could begin given when its predecessors started */

while ( there exists i such that RT[s+i] + RTn [i] > R)s = s+ 1;

/* delay the instruction further until the needed resources are available */

S(n) = s;

for (all i)RT[s + i] = RT [ s+i ] + RTn [i]

Prioritized topological OrderPossible prioritized orderings:1) Critical path - the longest path through the data-dependence graph.

Height of the node – the length of the longest path in the graph originating from the node.

2) The length of the schedule is constrained by the resource available.Critical resource - the one with the largest ratio of uses to the number of units of that resource available.

Operations using more critical resources may be given higher priority.

3) Source ordering – the operation that shows up earlier in the source program should be scheduled first.

Result of applying List Scheduling (for example in slide 22)ALU Memory

LD R3 , 8(R1) /* using height as the priority function */

LD R2, 0(R1)

ADD R3, R3,R4 /* 2 delay */

ADD R3,R3,R2 ST 4(R1) , R2

St 12(R1), R3

St 0(R1),R7

Global Code Scheduling Strategies that consider more than one Basic Block at a

time are referred to as Global Scheduling.

Conditions: ( must abide control and data dependencies)

1. All instructions in the original program are executed in the optimized one and

2. While the optimized program may execute extra instructions speculatively ,these instructions must not have any unwanted side effects.

Basic Block

A basic Block is constituted by set of instructions inwhich the control enters the block through the firstinstruction and leaves the block via the last instructionwithout any deterrence or jump / branch in betweenthem. ( the flow will be linear)

Primitive code motionSource Program

if ( a == 0) goto L

e = d + d

c = bL:

Locally Scheduled Machine code

LD R6 , 0(R1)nopBEQZ R6 , L

LD R7 ,0(R2)nopST 0(R3),R7

LD R8 , 0(R4)nopADD R8,R8,R8ST 0(R5), R8

Globally Scheduled machine code

LD R6 , 0(R1)LD R8 , 0(R4)LD R7 , 0(R2)

ADD R8,R8,R8BEQZ R6 , L

ST 0(R5), R8ST 0(R5) , R8ST 0(R3) , R7

Upward Code motionIt moves as operation from block src up a control-flow path to block dst.

such move does not violate any data dependences and it makes the path through dst and src run faster

Case 1: If src does not postdominate dst

In this case there exists a path that passes through dstthat does not reach src

This code motion is illegal unless tehoperation moved has no unwanted side effects

Contd…Case 2: If dst does not dominate srcIn this case there exists a path that reaches src without first going through dst.

We need to move copies of the moved operation along such pathsConstraints:

1.The operands of the operation must hold the same values as in the original.

2.The result does not overwrite a value that is still needed , and

3. It itself is not subsequently overwritten before reaching src.

Downward Code MotionIt is moving an operation from block src down a control

flow path to block dst

Case 1: src does not dominate dst – There exists a path to dst that does not passes through src.

Case 2: dst does not postdominate src - There exists a path through src does not pass through dst

E.g., If ( x == 0) a = b;

Else a =c;

(x==0)LD R1,x

NopBEQZ R1, L

(a = c)LD R3,c

NopST a,R3

( a= b)LD R2,b

NopST a, R2

(d =a)LD R4, a

NopST d, R4

x---0(R5)b-----0(R6)

c --------0(R7)a -------- 0(R8)d -------- 0(R9)

E.g., If ( x == 0) a = b;

Else a =c;

LD R1,0(R5), LD R3 , 0(R7)LD R2 , 0(R6)

ST 0(R8),R3

BEQZ R1, L /* CMOVZ 0(R8) ,R2,R1 */

ST 0(R8), R2

LD R4, 0(R8)Nop

ST 0(R9), R4

B4x---0(R5)b-----0(R6)

c --------0(R7)a -------- 0(R8)d -------- 0(R9)

Updating data dependences Code motions can change data dependence relations

between operations. Thus data dependences just be updated after each code motions

X = 1 X = 2

If one assignment is moved up the other can not.X is not live before code motion

Global Scheduling Algorithms Region Based Scheduling

Two easiest form of code motion

1. Moving operations up to control equivalent basic blocks

2. Moving operations speculatively up one branch to a dominating predecessor.

Assignment : Region Based Scheduling Algorithm

Loop Unrollingunrolling creates more instructions in the loop body permitting

global scheduling algorithms to find more parallelism

for (i = 0; i < N; i ++)

Can be unrolled

for ( i = 0; i+4 < N; i+=4) {

S(i+1);

S(i+2);

S(i+3);

repeat

until C;

Can be unrolled as

repeat {

if(C) break;

if (C) break;

} until C ;

Neighborhood Compaction

Examine each pair of basic blocks that are executedone after the other , and check if any operation can bemoved up or down between them to improve theexecution time to those blocks.

If such a pair is found we check if the instruction to bemoved needs to be duplicated along other paths.

Advanced Code Motion Techniques Adding new basic blocks along the control flow edges

originating from blocks with more than one predecessor. Moving instructions from basic blocks, so that the block can be eliminated completely.

The code to be executed in each basic block is scheduled once and for all as each block is visited, because algorithms only move operations up to dominating block.

Implementing downward code motion is harder in an algorithm that visits basic blocks in topological order , We move all operations that

i) can be moved and

ii) can not be executed in their native block

Interaction with dynamic Schedulers It can create new schedules according to the run time

conditions.

High latency instructions are issued early.

Data pre fetch instructions will help the dynamic scheduler to make them available advance.

Data dependent operations are put in correct order to ensure program correctness. For best performance the compiler should assign long delays to dependences that are likely to occur and short ones to those that are not likely.

Branch misprediction must be avoided

Software Pipelining

Software Pipelining Numerical applications often have loops whose

iterations are completely independent of one another.

These loops with many iterations have enough parallelism to saturate all the resources in a processor. It is up to the scheduler to take full advantage available parallelism.

Software Pipelining schedules an entire loop at a time to take full advantage of the parallelism across iterations.

Machine Model The machine can issue in a single clock : one load, one

store, one arithmetic operation and one branch operation.

The machine has a loop back operation

BL R, L

which decrements register R and , unless the result is 0, branches to location L.

Machine Model Memory operations have an auto increment

addressing mode , denoted by ++ after the register. The register is automatically incremented to point to the next consecutive address after each access.

The arithmetic operations are fully pipelined ; they can be initiated every clock but their results are not available until 2 clock later. All other instructions have a single- clock latency.

Typical do-all loop

for ( i = 0; i< n; i++)

D[i] = A[i] * B[i] + c;

//R1,R2,R3 = & A, &B, &D// R4 = c// R10 = n-1

LD R5 , 0(R1 ++)

LD R6 , 0(R2 ++)

MUL R7 , R5, R6

ADD R8 , R7, R4NopST 0(R3 ++) , R8 BL R10 , L

Locally scheduled code

Five unrolled iterations of e.g.,for (i = 0; i < n; i ++) D[i] = A[i] * B[i] + c ;

Clock j = 1 J =2 J = 3 J =4 J = 5

3 MUL LD

5 MUL LD

6 ADD LD

7 MUL LD

8 ST ADD LD

9 MUL LD

10 ST ADD LD

11 MUL

12 ST ADD

14 ST ADD

Clock j = 1 J =2 J = 3 J =4

3 MUL LD

5 MUL LD

6 ADD LD

7 L: MUL LD

8 ST ADD LD BL (L)

10 ST ADD

12 ST ADD

Software pipelined Code

A new iteration can be started on the pipeline every 2 clocks

When first iteration proceeds to stage three , the second iteration starts to execute.

By clock 7 the pipeline is fully filled with first four iterations.

In the steady state four consecutive iterations are executing at the same time.

The sequence of instructions 1 through 6 is called prolog.

7 and 8 are steady state.

lines 9 through 14 is called epilog.

Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana...

Engineering

Using Compiler Snippets to Exploit Parallelism on ......and exploiting reductions in Java applications on GPUs. In this paper we present our work in progress in utilizing compiler

Chapter 16 - Instruction-Level Parallelism and Superscalar ...web.ist.utl.pt/luis.tarrataca/classes/computer...Design Issues Machine Parallelism Machine Parallelism Machine parallelism

A Unified Compiler Algorithm for Optimizing Locality ...choudhar/Publications/KanCho97C.pdf · A Unified Compiler Algorithm for Optimizing Locality, Parallelism and ... compiles on

Parallelism: Avoiding Faulty Parallelism

Anna University Syllabus Materials and Question Papers_ Principles of Compiler Design 2 marks questions with answers.pdf

Lecture 23: Thread Level Parallelism --Introduction, SMP ... · 2 Topics for Thread Level Parallelism (TLP) § Parallelism (centered around … –Instruction Level Parallelism –Data

Review: Compiler techniques for parallelism Loop …kubitron/courses/cs152-F99/... · ° Key idea of Scoreboard: Allow instructions behind stall ... The Big Picture: Where are We

Compiler Design - Set 1 - Anna university all Sem lecturer ...anna.allsyllabus.com/CSE/sem_6/Principles of Compiler Design... · Compiler Design - Set 1 1. ... Mention some of the

Lecture 15: Instruction Level Parallelism€¦ · Lecture 15: Instruction Level Parallelism--5-stage Pipeline Extension, ILP Introduction, Compiler Techniques and Branch Prediction

Parallelism and Orders of Signification (Parallelism ...journal.oraltradition.org/files/articles/31ii/11_31.2.pdf · Parallelism and Orders of Signification (Parallelism Dynamics

Compiler Support for Exploiting Coarse-Grained Pipelined Parallelism Wei Du Renato Ferreira Gagan Agrawal Ohio-State University

Halide: A Language and Compiler for Optimizing Parallelism, …groups.csail.mit.edu/commit/papers/2013/halide-pldi13.pdf · to 5 faster than hand-optimized programs written by experts—

ANNA UNIVERSITY, CHENNAI AFFILIATED INSTITUTIONS R … - BECSE.pdf · ANNA UNIVERSITY, CHENNAI AFFILIATED INSTITUTIONS R-2013 ... CS6612 Compiler Laboratory 0 3 2 ... (using language

PARALLELISM PARALLELISM PARALLELISM

A Framework Exploiting Data and Functional · compiler for exploiting functional and data parallelism together. For our discussion, we define Functional Parallelism to be any parallelism

Instruction-Level Parallelism compiler techniques and branch prediction

Exploiting SIMD parallelism with the CGiS compiler framework

Discovering and exploiting parallelism in DOACROSS loops · parallelism may depend on the inputs to the program and cannot be exploited by the compiler. The downside of optimistic

Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions

Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Deﬁnitions Short history Architecture Execution models Porting strategy PGI compiler Proﬁling the code 2 Parallelism