View
191
Download
2
Category
Preview:
DESCRIPTION
Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani. prepared from Compilers by Aho, Ullman
Citation preview
Compiler Optimization Techniques
CP 7031
Dr.K.Thirunadana Sikamani
Principal Sources of OptimizationElimination of unnecessary instructions in object code ,
or the replacement of one sequence of instructions by a
faster sequence of instructions that does the same thing
is usually called “code improvement” or “code
optimization”
Redundancy
Semantic preserving transformations
Global Common Subexpressions
Copy Propagation
Dead Code Elimination
Code Motion8/25/2014 Compiler OptimizationTechniques - unit II 2
The Speed of a program run on a processor with Instruction Level Parallelism depends on
1. The potential parallelism in the program.
2. The available parallelism on the processor.
3. Our ability to extract parallelism from the original sequential program.
4. Our ability to find the best parallel schedule given scheduling constraints.
8/25/2014 3Compiler OptimizationTechniques - unit II
Processor Architecture
8/25/2014 4Compiler OptimizationTechniques - unit II
1. Instruction Pipelines and Branch delays
2. Pipelined Execution
3. Multiple Instruction Issues –VLIW ( Very Long Instruction Word)
8/25/2014 5Compiler OptimizationTechniques - unit II
Code Scheduling Constraints
8/25/2014 6Compiler OptimizationTechniques - unit II
1. Control-dependence constraints
2. Data-dependence Constraints
3. Resource Constraints
8/25/2014 7Compiler OptimizationTechniques - unit II
Control dependence constraints
All the operations executed in original
program must be executed in the optimized one
8/25/2014 8Compiler OptimizationTechniques - unit II
Data Dependence Constraints
The operations in the optimized program must produce the same results
as the corresponding ones in the original program
8/25/2014 9Compiler OptimizationTechniques - unit II
Resource Constraints
The schedule must not oversubscribe the resources on the
machine
8/25/2014 10Compiler OptimizationTechniques - unit II
Data Dependence
True dependence - Read after Write
Antidependence - Write after Read
Output dependence - Write after Write
8/25/2014 11Compiler OptimizationTechniques - unit II
Classify dependence for the following statements 1. a =b
2.c =d
3.b =c
4. d =a
5. c= d
6. a = b
8/25/2014 Compiler OptimizationTechniques - unit II 12
1 and 43 and 51 and 6
Check the dependences for the following
Give the register level m/c code to provide maxmparallelism also give the solution for minimal usage of register
expression ((u+v) + (w+x)) + (y+z)
LD r1,uLD r2,vADD r1,r1,r2LD r2,wLD r3,xADD r2,r2,r3ADD r1,r1,r2LD r2,yLDr3,zADD r2,r2,r3ADD r1,r1,r2
8/25/2014 Compiler OptimizationTechniques - unit II 13
Clock1
LD r1,u
LD r2,v
LD r3,w
LD r4,x
LD r5,y
LD r6,z
Clock2
ADD r1,r1,r2
ADD r3,r3,r4
ADD r5,r5,r6
Clock3
ADD r1,r1,r3
clock4
ADD r1,r1,r5
Implementation of parallelism in 4 clocks
Finding dependences among memory Access1. Array data dependence analysis
for ( i = 0; i < n; i++)
A[2*i] = A[2* i+1]
2. Pointer alias analysis
Two pointers aliased if they refer to the same object
3. Inter procedural analysis It is to determine if same variable is passed as two or more
different arguments in passing parameters by reference language
8/25/2014 14Compiler OptimizationTechniques - unit II
Tradeoff between Register usage and Parallelisme.g., machine independent intermediate representation code
LD t1 , a
ST b , t1
LD t2 , c
ST d , t2
the code above is to copy the values of a and c to b and d . If all memory locations are distinct the copies can be proceed in parallel . The other case if t1 and t2 are assigned to use the same register to minimize the register usage.
8/25/2014 15Compiler OptimizationTechniques - unit II
Tradeoff between Register U sage and Parallelism
The syntax tree for the (a+b) + c + ( d+ e)
a b
+
+
+
+
C
d e
Machine code
LD r1 , aLDr2 , bADD r1,r1,r2LD r2 , cADD r1,r1,r2LD r2, dLD r3, eADD r2,r2,r3ADD r1,r1,r2
Parallel evaluation of the expression
r1 =ar6=r1+r2r8=r6+r3R9=r8+r7
r2=br7=r4+r5
r3=c r4=d r5=e
8/25/2014 16Compiler OptimizationTechniques - unit II
Phase Ordering between register allocation and Code Scheduling If registers are allocated before scheduling , the
resulting code tends to have many storage dependences that limit code scheduling.
On the other way around , the schedule created may require so many registers that register spilling
Spilling – storing the contents of a register in a memory location, so the register can be used for some other purpose.
Based on the characteristics of the program.
e.g., numeric , non numeric, etc.,
8/25/2014 17Compiler OptimizationTechniques - unit II
Control Dependence
If ( c ) s1; else s2; /* s1 and s2 are control dependent on c */
While ( c ) s; /* s is dependent on c */
if ( a > t )
b = a * a;
d = a + c; / * No dependence * /
8/25/2014 18Compiler OptimizationTechniques - unit II
Speculative Execution Support Prefetching - Bringing data from memory to cache
before it is used.
Poison Bits – Speculative load of data from memory to register file. Each register is augmented with poison bit. The poison bit is set when an illegal memory is accessed to raise exception at later usage.
8/25/2014 19Compiler OptimizationTechniques - unit II
Predicated Execution Predicated instructions were invented to reduce the
number of branches in a program.
A predicated instruction is like a normal instruction but has an extra predicate operand to guard its execution.
E.g., CMOVZ R2, R3, R1 has the semantics of moving contents of R3 to R2 if R1 is zero
if ( a == 0 ) b = c + d; can be implemented as
ADD R3 , R4 ,R5 /* a ,b,c ,d are allotedR1, R2 , R4,R5 */
CMOVZ R2, R3, R1
8/25/2014 20Compiler OptimizationTechniques - unit II
Basic Machine Model
Many machines can be represented as
M = < R , T >
T – Set of operation types T, such as loads, stores and arithmetic operations.
R is a vector – R = [ r1,r2,…..] are hardware resources.
r1 - number of units availabel of the ith kind of resources.
Resources – memory access units, ALUs, floating point functional units.
8/25/2014 21Compiler OptimizationTechniques - unit II
Basic Machine Model Each operation has a set of input operands , a set of
output operands and resource requirement
RTt– Resource –Reservation table
RTt[i,j]- is the number of units of jth resource used by an operation type t, i clocks after it is issued.
8/25/2014 22Compiler OptimizationTechniques - unit II
Basic-Block Scheduling
8/25/2014 23Compiler OptimizationTechniques - unit II
Data-Dependence Graphs
Graph G = ( N , E)
N --
E ---
A set of nodes representing the operations in the machine instructions.
A set of directed edges representing the data dependence constraints among operations
1. Each operation n in N has a resource reservation table RTn , whose value is simply the resource – reservation table associated with operation type of n
2. Each edge e in E is labeled with delay de indicating that the destination node must be issued no earlier than de clocks after the source node is issued.
8/25/2014 24Compiler OptimizationTechniques - unit II
Data- dependence Graph
LD R2, 0(R1)
ST 4(R1), R2
ADD R3,R3,R2
ADD R3, R3, R4
Ld R3, 8 (R1)
ST 0(R7), R7
ST 12(R1), R3
i1
i2
i3
i4
i5
i6
i7
2
2
1
1
1
1
1
1
2
1.Load operation takes 2 clock cycles2. R1 is a stack pointer having offset from 0 t0 12
8/25/2014 25Compiler OptimizationTechniques - unit II
List Scheduling of Basic Blocks This involves visiting each node of the
data-de pendence graph in “prioritized topological order”
Machine-resource vector R = [r1,r2,r3,..]
ri --- the number of units available of the ith kind of resource
G = ( N,E) data dependence graph
RTn ---- Resource -reservation table
Edge e = n1 n2 with de indicating n2
would be executed de delays after n1.
8/25/2014 Compiler OptimizationTechniques - unit II 26
List Scheduling AlgorithmRT = An empty reservation table
for ( each n in N in prioritized topological order){s = max e=p ->n in E (S(p) + de);
/* find the earliest time this instruction this instruction could begin given when its predecessors started */
while ( there exists i such that RT[s+i] + RTn [i] > R)s = s+ 1;
/* delay the instruction further until the needed resources are available */
S(n) = s;
for (all i)RT[s + i] = RT [ s+i ] + RTn [i]
}
8/25/2014 Compiler OptimizationTechniques - unit II 27
Prioritized topological OrderPossible prioritized orderings:1) Critical path - the longest path through the data-dependence graph.
Height of the node – the length of the longest path in the graph originating from the node.
2) The length of the schedule is constrained by the resource available.Critical resource - the one with the largest ratio of uses to the number of units of that resource available.
Operations using more critical resources may be given higher priority.
3) Source ordering – the operation that shows up earlier in the source program should be scheduled first.
8/25/2014 Compiler OptimizationTechniques - unit II 28
Result of applying List Scheduling (for example in slide 22)ALU Memory
LD R3 , 8(R1) /* using height as the priority function */
LD R2, 0(R1)
ADD R3, R3,R4 /* 2 delay */
ADD R3,R3,R2 ST 4(R1) , R2
St 12(R1), R3
St 0(R1),R7
8/25/2014 Compiler OptimizationTechniques - unit II 29
Global Code Scheduling Strategies that consider more than one Basic Block at a
time are referred to as Global Scheduling.
Conditions: ( must abide control and data dependencies)
1. All instructions in the original program are executed in the optimized one and
2. While the optimized program may execute extra instructions speculatively ,these instructions must not have any unwanted side effects.
8/25/2014 Compiler OptimizationTechniques - unit II 30
Basic Block
A basic Block is constituted by set of instructions inwhich the control enters the block through the firstinstruction and leaves the block via the last instructionwithout any deterrence or jump / branch in betweenthem. ( the flow will be linear)
8/25/2014 Compiler OptimizationTechniques - unit II 31
Primitive code motionSource Program
8/25/2014 Compiler OptimizationTechniques - unit II 32
if ( a == 0) goto L
e = d + d
c = bL:
Locally Scheduled Machine code
8/25/2014 Compiler OptimizationTechniques - unit II 33
LD R6 , 0(R1)nopBEQZ R6 , L
LD R7 ,0(R2)nopST 0(R3),R7
LD R8 , 0(R4)nopADD R8,R8,R8ST 0(R5), R8
B1
B2
B3
L:
Globally Scheduled machine code
8/25/2014 Compiler OptimizationTechniques - unit II 34
LD R6 , 0(R1)LD R8 , 0(R4)LD R7 , 0(R2)
ADD R8,R8,R8BEQZ R6 , L
ST 0(R5), R8ST 0(R5) , R8ST 0(R3) , R7
B1
B3’
B3
Upward Code motionIt moves as operation from block src up a control-flow path to block dst.
such move does not violate any data dependences and it makes the path through dst and src run faster
Case 1: If src does not postdominate dst
In this case there exists a path that passes through dstthat does not reach src
This code motion is illegal unless tehoperation moved has no unwanted side effects
8/25/2014 Compiler OptimizationTechniques - unit II 35
Contd…Case 2: If dst does not dominate srcIn this case there exists a path that reaches src without first going through dst.
We need to move copies of the moved operation along such pathsConstraints:
1.The operands of the operation must hold the same values as in the original.
2.The result does not overwrite a value that is still needed , and
3. It itself is not subsequently overwritten before reaching src.
8/25/2014 Compiler OptimizationTechniques - unit II 36
Downward Code MotionIt is moving an operation from block src down a control
flow path to block dst
Case 1: src does not dominate dst – There exists a path to dst that does not passes through src.
Case 2: dst does not postdominate src - There exists a path through src does not pass through dst
8/25/2014 Compiler OptimizationTechniques - unit II 37
E.g., If ( x == 0) a = b;
Else a =c;
d= a;
8/25/2014 Compiler OptimizationTechniques - unit II 38
(x==0)LD R1,x
NopBEQZ R1, L
(a = c)LD R3,c
NopST a,R3
( a= b)LD R2,b
NopST a, R2
(d =a)LD R4, a
NopST d, R4
B1
B2B3
B4
x---0(R5)b-----0(R6)
c --------0(R7)a -------- 0(R8)d -------- 0(R9)
L:
E.g., If ( x == 0) a = b;
Else a =c;
d= a;
8/25/2014 Compiler OptimizationTechniques - unit II 39
LD R1,0(R5), LD R3 , 0(R7)LD R2 , 0(R6)
ST 0(R8),R3
BEQZ R1, L /* CMOVZ 0(R8) ,R2,R1 */
ST 0(R8), R2
LD R4, 0(R8)Nop
ST 0(R9), R4
B1
B2
B4x---0(R5)b-----0(R6)
c --------0(R7)a -------- 0(R8)d -------- 0(R9)
L:
Updating data dependences Code motions can change data dependence relations
between operations. Thus data dependences just be updated after each code motions
8/25/2014 Compiler OptimizationTechniques - unit II 40
X = 1 X = 2
If one assignment is moved up the other can not.X is not live before code motion
Global Scheduling Algorithms Region Based Scheduling
Two easiest form of code motion
1. Moving operations up to control equivalent basic blocks
2. Moving operations speculatively up one branch to a dominating predecessor.
Assignment : Region Based Scheduling Algorithm
8/25/2014 Compiler OptimizationTechniques - unit II 41
Loop Unrollingunrolling creates more instructions in the loop body permitting
global scheduling algorithms to find more parallelism
for (i = 0; i < N; i ++)
{
S(i);
}
Can be unrolled
for ( i = 0; i+4 < N; i+=4) {
S(i);
S(i+1);
S(i+2);
S(i+3);
}
repeat
S;
until C;
Can be unrolled as
repeat {
S;
if(C) break;
S;
if (C) break;
S;
} until C ;
8/25/2014 Compiler OptimizationTechniques - unit II 42
Neighborhood Compaction
Examine each pair of basic blocks that are executedone after the other , and check if any operation can bemoved up or down between them to improve theexecution time to those blocks.
If such a pair is found we check if the instruction to bemoved needs to be duplicated along other paths.
8/25/2014 Compiler OptimizationTechniques - unit II 43
Advanced Code Motion Techniques Adding new basic blocks along the control flow edges
originating from blocks with more than one predecessor. Moving instructions from basic blocks, so that the block can be eliminated completely.
The code to be executed in each basic block is scheduled once and for all as each block is visited, because algorithms only move operations up to dominating block.
Implementing downward code motion is harder in an algorithm that visits basic blocks in topological order , We move all operations that
i) can be moved and
ii) can not be executed in their native block
8/25/2014 Compiler OptimizationTechniques - unit II 44
Interaction with dynamic Schedulers It can create new schedules according to the run time
conditions.
High latency instructions are issued early.
Data pre fetch instructions will help the dynamic scheduler to make them available advance.
Data dependent operations are put in correct order to ensure program correctness. For best performance the compiler should assign long delays to dependences that are likely to occur and short ones to those that are not likely.
Branch misprediction must be avoided
8/25/2014 Compiler OptimizationTechniques - unit II 45
Software Pipelining
8/25/2014 Compiler OptimizationTechniques - unit II 46
Software Pipelining Numerical applications often have loops whose
iterations are completely independent of one another.
These loops with many iterations have enough parallelism to saturate all the resources in a processor. It is up to the scheduler to take full advantage available parallelism.
Software Pipelining schedules an entire loop at a time to take full advantage of the parallelism across iterations.
8/25/2014 Compiler OptimizationTechniques - unit II 47
Machine Model The machine can issue in a single clock : one load, one
store, one arithmetic operation and one branch operation.
The machine has a loop back operation
BL R, L
which decrements register R and , unless the result is 0, branches to location L.
8/25/2014 Compiler OptimizationTechniques - unit II 48
Machine Model Memory operations have an auto increment
addressing mode , denoted by ++ after the register. The register is automatically incremented to point to the next consecutive address after each access.
The arithmetic operations are fully pipelined ; they can be initiated every clock but their results are not available until 2 clock later. All other instructions have a single- clock latency.
8/25/2014 Compiler OptimizationTechniques - unit II 49
Typical do-all loop
for ( i = 0; i< n; i++)
D[i] = A[i] * B[i] + c;
8/25/2014 Compiler OptimizationTechniques - unit II 50
//R1,R2,R3 = & A, &B, &D// R4 = c// R10 = n-1
LD R5 , 0(R1 ++)
LD R6 , 0(R2 ++)
MUL R7 , R5, R6
Nop
ADD R8 , R7, R4NopST 0(R3 ++) , R8 BL R10 , L
L:
Locally scheduled code
Five unrolled iterations of e.g.,for (i = 0; i < n; i ++) D[i] = A[i] * B[i] + c ;
8/25/2014 Compiler OptimizationTechniques - unit II 51
Clock j = 1 J =2 J = 3 J =4 J = 5
1 LD
2 LD
3 MUL LD
4 LD
5 MUL LD
6 ADD LD
7 MUL LD
8 ST ADD LD
9 MUL LD
10 ST ADD LD
11 MUL
12 ST ADD
13
14 ST ADD
15
16 ST
Clock j = 1 J =2 J = 3 J =4
1 LD
2 LD
3 MUL LD
4 LD
5 MUL LD
6 ADD LD
7 L: MUL LD
8 ST ADD LD BL (L)
9 MUL
10 ST ADD
11
12 ST ADD
13
14 ST
8/25/2014 Compiler OptimizationTechniques - unit II 52
Software pipelined Code
A new iteration can be started on the pipeline every 2 clocks
When first iteration proceeds to stage three , the second iteration starts to execute.
By clock 7 the pipeline is fully filled with first four iterations.
In the steady state four consecutive iterations are executing at the same time.
The sequence of instructions 1 through 6 is called prolog.
7 and 8 are steady state.
lines 9 through 14 is called epilog.
8/25/2014 Compiler OptimizationTechniques - unit II 53
Recommended