Embedded Systems in SiliconTD5102
Compilerswith emphasis on ILP compilation
Henk Corporaalhttp://www.ics.ele.tue.nl/~heco/courses/EmbSystems
Technical University Eindhoven
DTI / NUS Singapore
2005/2006
H.C. TD 5102 2
Compiling for ILP Architectures
Overview:
• Motivation and Goals
• Measuring and exploiting available parallelism
• Compiler basics
• Scheduling for ILP architectures
• Summary and Conclusions
H.C. TD 5102 3
Motivation
• Performance requirements increase• Applications may contain much instruction
level parallelism• Processors offer lots of hardware
concurrency
Problem to be solved: – how to exploit this concurrency automatically?
H.C. TD 5102 4
Goals of code generation
• High speedup– Exploit all the hardware concurrency– Extract all application parallelism
• obey true dependencies only• resolve false dependencies by renaming
• No code rewriting: automatic parallelization– However: application tuning may be required
• Limit code expansion
H.C. TD 5102 5
Overview
• Motivation and Goals
• Measuring and exploiting available parallelism
• Compiler basics
• Scheduling for ILP architectures
• Summary and Conclusions
H.C. TD 5102 6
Measuring and exploiting available parallelism
• How to measure parallelism within applications?– Using existing compiler– Using trace analysis
• Track all the real data dependencies (RaWs) of instructions from issue window
– register dependence– memory dependence
• Check for correct branch prediction– if prediction correct continue– if wrong, flush schedule and start in next cycle
H.C. TD 5102 7
Trace analysis
Program
For i := 0..2
A[i] := i;
S := X+3;
Compiled code
set r1,0
set r2,3
set r3,&A
Loop: st r1,0(r3)
add r1,r1,1
add r3,r3,4
brne r1,r2,Loop
add r1,r5,3
Execution trace
set r1,0
set r2,3
set r3,&A
st r1,0(r3)
add r1,r1,1
add r3,r3,4
brne r1,r2,Loop
st r1,0(r3)
add r1,r1,1
add r3,r3,4
brne r1,r2,Loop
st r1,0(r3)
add r1,r1,1
add r3,r3,4
brne r1,r2,Loop
add r1,r5,3How parallel can this code be executed?
H.C. TD 5102 8
Trace analysis
Parallel Trace
set r1,0 set r2,3 set r3,&A
st r1,0(r3) add r1,r1,1 add r3,r3,4
st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop
st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop
brne r1,r2,Loop
add r1,r5,3
Max ILP = Speedup = Lparallel / Lserial = 16 / 6 = 2.7
H.C. TD 5102 9
Ideal ProcessorAssumptions for ideal/perfect processor:
1. Register renaming – infinite number of virtual registers => all register WAW & WAR hazards avoided
2. Branch and Jump prediction – Perfect => all program instructions available for execution
3. Memory-address alias analysis – addresses are known. A store can be moved before a load provided addresses not equal
Also: – unlimited number of instructions issued/cycle (unlimited resources), and
– unlimited instruction window
– perfect caches
– 1 cycle latency for all instructions (FP *,/)
Programs were compiled using MIPS compiler with maximum optimization level
H.C. TD 5102 10
Upper Limit to ILP: Ideal Processor
Programs
Inst
ruct
ion
Iss
ues
per
cycl
e
0
20
40
60
80
100
120
140
160
gcc espresso li fpppp doducd tomcatv
54.862.6
17.9
75.2
118.7
150.1
Integer: 18 - 60 FP: 75 - 150
IPC
H.C. TD 5102 11
Different effects reduce the exploitable parallelism
• Reducing window size– i.e., the number of instructions to choose from
• Non-perfect branch prediction– perfect (oracle model)– dynamic predictor
(e.g. 2 bit prediction table with finite number of entries)
– static prediction (using profiling)– no prediction
• Restricted number of registers for renaming– typical superscalars have O(100) registers
• Restricted number of other resources, like FUs
H.C. TD 5102 12
• Non-perfect alias analysis (memory disambiguation)
Models to use:– perfect– inspection: no dependence in following cases:
r1 := 0(r9) r1 := 0(fp)
4(r9) := r2 0(gp) := r2
A more advanced analysis may disambiguate most stack and global references, but not the heap references
– none• Important:
– good branch prediction, 128 registers for renaming, alias analysis on stack and global accesses, and for FloatingPt a large window size
Different effects reduce the exploitable parallelism
H.C. TD 5102 13
Summary• Amount of parallelism is limited
– higher in Multi-Media– higher in kernels
• Trace analysis detects all types of parallelism– task, data and operation types
• Detected parallelism depends on– quality of compiler– hardware– source-code transformations
H.C. TD 5102 14
Overview
• Motivation and Goals• Measuring and exploiting available
parallelism• Compiler basics• Scheduling for ILP architectures• Source level transformations• Compilation frameworks• Summary and Conclusions
H.C. TD 5102 15
Compiler basics
• Overview– Compiler trajectory / structure / passes– Abstract Syntax Tree (AST)– Control Flow Graph (CFG)– Data Dependence Graph (DDG)– Basic optimizations– Register allocation– Code selection
H.C. TD 5102 16
Compiler basics: trajectory
Preprocessor
Compiler
Assembler
Loader/Linker
Source program
Object program
Error messages
Library code
H.C. TD 5102 17
Compiler basics: structure / passes
Lexical analyzer
Parsing
Code optimization
Register allocation
Source code
Sequential code
Intermediate code
Code generation
Scheduling and allocation
Object code
token generation
check syntax check semantic parse tree generation
data flow analysis local optimizations global optimizationscode selection peephole optimizations
making interference graph graph coloring spill code insertion caller / callee save and restore code
exploiting ILP
H.C. TD 5102 18
Compiler basics: structure Simple compilation example
Lexical analyzer
Syntax analyzer
Intermediate code generator
position := initial + rate * 60
id := id + id * 60
:=
+id
*id
60id
Code optimizer
Code generator
temp1 := intoreal(60)temp2 := id3 * temp1temp3 := id2 + temp2id1 := temp3
temp1 := id3 * 60.0id1 := id2 + temp1
movf id3, r2mulf #60, r2, r2movf id2, r1addf r2, r1movf r1, id1
H.C. TD 5102 19
Compiler basics: structure - SUIF-1 toolkit example
pre-processing
C front-end
converting non-standard structures to SUIF
constant propagation
forward propagation
induction variable identification
scalar privatization analysis
reduction analysis
locality optimization and parallelism analysis
parallel code generation
FORTRAN specific transformations
SUIF to text SUIF to postscript SUIF to C
SUIF text postscript C
FORTRAN to C
FORTRAN C
high-SUIF to low-SUIF
constant propagation
strength reduction
dead-code elimination
register allocation
assembly code generation
assembly code
H.C. TD 5102 20
Compiler basics: Abstract Syntax Tree (AST)
C input code:
if (a > b) { r = a % b; } else { r = b % a; }
Parse tree: ‘infinite’ nesting:
Stat IF Cmp > Var a Var b Statlist Stat Expr Assign Var r Binop % Var a Var b
Statlist Stat Expr Assign Var r Binop % Var b Var a
H.C. TD 5102 21
Compiler basics: Control flow graph (CFG)
C input code:
CFG: 1 sub t1, a, b bgz t1, 2, 3
4 ………….. …………..
3 rem r, b, a goto 4
2 rem r, a, b goto 4
Program, is collection of Functions, each function is collection of Basic Blocks, each BB contains set of Instructions, each instruction consists of several Transports,..
if (a > b) { r = a % b; } else { r = b % a; }
H.C. TD 5102 22
a := b + 15;
c := 3.14 * d;
e := c / f;
Translation to DDG
ld
+
st
&b
15
&a
ld *
/ st
ld
st
&f 3.14
&e
&d
&c
Data Dependence Graph (DDG)
H.C. TD 5102 23
• Machine independent optimizations
• Machine dependent optimizations
(details are in any good compiler book)
Compiler basics: Basic optimizations
H.C. TD 5102 24
– Common subexpression elimination– Constant folding– Copy propagation– Dead-code elimination– Induction variable elimination– Strength reduction– Algebraic identities
• Commutative expressions• Associativity: Tree height reduction
– Note: not always allowed(due to limited precision)
Machine independent optimizations
H.C. TD 5102 25
What’s the optimal implementation of a*34 ?– Use multiplier: mul Tb,Ta,34
• Pro: No thinking required• Con: May take many cycles
– Alternative:SHL Tc, Ta, 1ADD Tb, Tc, TzeroSHL Tc, Tc, 4ADD Tb, Tb, Tc
• Pros: May take fewer cycles• Cons:• Uses more registers• Additional instructions ( I-cache load / code size)
Machine dependent optimization example
H.C. TD 5102 26
• Register Organization Conventions needed for parameter passing and register usage across function calls; a MIPS example:
Compiler basics: Register allocation
r31
r21
r20
r11
r10
r1
r0
Callee saved registers
Caller saved registers
Argument and result transfer
Hard-wired 0
Temporaries
H.C. TD 5102 27
Register allocation using graph coloring
Given a set of registers, what is the most efficient mapping of registers to program variables in terms of execution time of the program?
• A variable is defined at a point in program when a value is assigned to it.
• A variable is used at a point in a program when its value is referenced in an expression.
• The live range of a variable is the execution range between definitions and uses of a variable.
H.C. TD 5102 28
Program:
a := c := b := := bd := := a := c := d
a b c dLive Ranges
Register allocation using graph coloring
Example:
H.C. TD 5102 29
Register allocation using graph coloring
a
b c
d
Inference Graph
a
b c
d
Coloring:a = redb = greenc = blued = green
Graph needs 3 colors (chromatic nr =3)=> program needs 3 registers
H.C. TD 5102 30
Register allocation using graph coloringSpill/ Reload code Spill/ Reload code is needed when there are not enough colors (registers) to color the interference graph
Example: Only two registers available !!
Program:
a := c := store cb := := bd := := aload c := c := d
a b c dLive Ranges
H.C. TD 5102 31
• CISC era– Code size important– Determine shortest sequence of code
• Many options may exist
– Pattern matchingExample M68020:
D1 := D1 + M[ M[10+A1] + 16*D2 + 20 ] ADD ([10,A1], D2*16, 20), D1
• RISC era– Performance important– Only few possible code sequences– New implementations of old architectures optimize RISC
part of instruction set only; for e.g. i486 / Pentium / M68020
Compiler basics: Code selection
H.C. TD 5102 32
Overview
• Motivation and Goals• Measuring and exploiting available
parallelism• Compiler basics• Scheduling for ILP architectures• Source level transformations• Compilation frameworks• Summary and Conclusions
H.C. TD 5102 33
What is scheduling?• Time allocation:
– Assigning instructions or operations to time slots– Preserve dependences:
• Register dependences• Memory dependences
– Optimize code with respect to performance/ code size/ power consumption/ ..
• Space allocation – satisfy resource constraints:
• Bind operations to FUs• Bind variables to registers/ register files• Bind transports to buses
H.C. TD 5102 34
Why scheduling?Let’s look at the execution time:
Texecution = Ncycles x Tcycle
= Ninstructions x CPI x Tcycle
Scheduling may reduce Texecution
– Reduce CPI (cycles per instruction)• early scheduling of long latency operations• avoid pipeline stalls due to structural, data and control hazards
• allow Nissue > 1 and therefore CPI < 1
– Reduce Ninstructions
• compact many operations into each instruction (VLIW)
H.C. TD 5102 35
Scheduling data hazards RaW dependence
Avoiding RaW stalls:
Reordering of instructions by the compiler
Example: avoiding one-cycle load interlock
Code:
a = b + c d = e - f
Unscheduled code:Lw R1,b Lw R2,c Add R3,R1,R2 interlock Sw a,R3 Lw R1,e Lw R2,f Sub R4,R1,R2 interlock Sw d,R4
Scheduled code: Lw R1,b Lw R2,c Lw R5,e extra reg. needed! Add R3,R1,R2 Lw R2,f Sw a,R3 Sub R4,R5,R2 Sw d,R4
H.C. TD 5102 36
Scheduling control hazardsBranch requires 3 actions:• Compute new address• Determine condition• Perform the actual branch (if taken): PC := new address
IF ID OF EX WB
IF ID OF EX WB
IF ID OF EX WB
IF ID OF EX
IF ID OF EX WB
time
Branch L
Predict not taken
L:
H.C. TD 5102 37
Control hazards: what's the penalty?
CPI = CPIideal + fbranch x Pbranch
Pbranch = Ndelayslots x miss_rate
• Superscalars tend to have large branch penalty Pbranch due to– many pipeline stages– multiple instructions (or operations) / cycle
• Note: – the lower CPI the larger the effect of penalties
H.C. TD 5102 38
What can we do about control hazards and CPI penalty?• Keep penalty Pbranch low:
– Early computation of new PC– Early determination of condition– Visible delay slots filled by compiler (MIPS)
• Branch prediction• Reduce control dependencies (control height
reduction) [Schlansker and Kathail, Micro’95]• Remove branches: if-conversion
– Conditional instructions: CMOVE, cond skip next– Guarding all instructions: TriMedia
H.C. TD 5102 39
Scheduling: Conditional instructions
After conversion:
• Example: Cmove (supported by Alpha)
If (A=0) S = T;
assume:
r1: A,
r2: S,
r3: T
Object code: Bnez r1, LMov r2, r3
L: . . . .
Cmovz r2, r3, r1
H.C. TD 5102 40
Scheduling: Conditional instructionsConditional instructions are useful, however:• Squashed instructions still take execution time and execution resources
– Consequence: long target blocks can not be if-converted • Condition has to be known early• Moving operations across multiple branches requires complicated
predicates• Compatibility: change of ISA (instruction set architecture)
Practice:• Current superscalars support a limited set of conditional instructions• CMOVE: alpha, MIPS, PowerPC, SPARC• HP PA: any RR instruction can conditionally squash next instruction
Large VLIWs profit from making all instructions conditional• guarded execution: TriMedia, Intel/HP IA-64, TI C6x
H.C. TD 5102 41
Guarded executionSLT r1,r2,r3
BEQ r1,r0, else
then: ADDI r2,r2,1
..X..
j cont
else: SUBI r2,r2,1
..Y..
cont: MUL r4,r2
SLT b1,r2,r3
b1:ADDI r2,r2,1 !b1: SUBI r2,r2,1
b1:..X.. !b1: ..Y..
MUL r4,r2
IF-conversion
H.C. TD 5102 42
Scheduling: Conditional instructions
Full guard supportIf-conversion of conditional codeAssume:• tbranch branch latency• pbranch branching probability• ttrue execution time of the TRUE branch• tfalse execution time of the FALSE branch Execution times of original and if-converted code for non-ILP
architecture:
toriginal_code = (1 + pbranch) x tbranch + p x ttrue + (1 - pbranch) x tfalse
tif_converted_code = ttrue + tfalse
H.C. TD 5102 43
Scheduling: Conditional instructions
Speedup of if-converted code for non-ILP architectures
Only interesting for short target blocks!
H.C. TD 5102 44
Scheduling: Conditional instructionsSpeedup of if-converted code for ILP architectures with sufficient resources
Much larger area of interest !!
convertedift
tif_converted = max(ttrue, tfalse)
H.C. TD 5102 45
Scheduling: Conditional instructions
• Full guard support for large ILP architectures has a number of advantages:– Removing unpredictable branches– Enlarging scheduling scope– Enabling software pipelining– Enhancing code motion when speculation is not
allowed– Resource sharing; even when speculation is allowed
guarding may be profitable
H.C. TD 5102 46
Scheduling: OverviewTransforming a sequential program into a parallel program:
read sequential program read machine description file for each procedure do
perform function inlining
for each procedure dotransform an irreducible CFG into a reducible CFG perform control flow analysis perform loop unrolling perform data flow analysis perform memory reference disambiguation perform register allocation for each scheduling scope do
perform instruction scheduling write parallel program
H.C. TD 5102 47
Scheduling: Int.Lin.Programming
Integer linear programming scheduling method• Introduce:
– Decision variables: xi,j = 1 if operation i is scheduled in cycle j– Constraints like:– Limited resources:
where xt operation of type t and Mt number of resources of type t– Data dependence constraints– Timing constraints
• Problem: too many decision variables
i
ttj,i, Mx j,
H.C. TD 5102 48
List Scheduling
• Make a dependence graph• Determine minimal length• Determine ASAP, ALAP, and slack of each operation• Place each operation in first cycle with sufficient
resources
Note:– Scheduling order sequential– Priority determined by used heuristic; e.g. slack
H.C. TD 5102 49
Basic Block Scheduling
ADD
LD
A C
y
<1,3>
<2,4>MUL
A B
z
<1,4>
ADD
ADD
SUB
NEG LD
A
B C
X
<3,3>
<4,4>
<2,2>
<2,3>
<1,1>
ASAP cycle
ALAP cycle
slack
H.C. TD 5102 50
ASAP and ALAP formulas
asap(v) =
max{asap(u) + delay(u,v) | (u,v) E } if pred(v)
0 otherwise
alap(v) = min{alap(u) - delay(u,v) | (u,v) E } if succ(v)
Lmax otherwise
slack(v) = alap(v) - asap(v)
H.C. TD 5102 51
Cycle based list schedulingproc Schedule(DDG = (V,E))beginproc ready = { v | (u,v) E } // all nodes which have no predecessor ready’ = ready // all nodes which can be scheduled in sched = // current cycle current_cycle = 0 while sched V do for each v ready’ do if ResourceConfl(v,current_cycle, sched) then cycle(v) = current_cycle sched = sched {v} endif endfor current_cycle = current_cycle + 1 ready = { v | v sched (u,v) E, u sched } ready’ = { v | v ready (u,v) E, cycle(u) + delay(u,v) current_cycle}endwhileendproc
H.C. TD 5102 52
Problem with basic block scheduling
• Basic blocks contain on average only about 6 instructions
• Unrolling may help for loops
• Go beyond basic blocks: 1. Extended basic block scheduling 2. Software pipelining
H.C. TD 5102 53
Extended basic block scheduling: Scope
B C
E F
D
G
A
Trace Superblock
B C
F E’
D’
G’
A
E
D
G
tail duplication
Partitioning a CFG into scheduling scopes:
H.C. TD 5102 54
Extended basic block scheduling: Scope
B C
E F
D
G
A
Hyperblock/ region
Partitioning a CFG into scheduling scopes:
B C
E’ F’
D’
G’’
A
E
D
G
Decision Tree
tail duplication
F
G’
H.C. TD 5102 55
Comparing scheduling scopes:
Trace Sup. block
Hyp. block
Dec. Tree
Region
Multiple exc. paths No No Yes Yes Yes Side-entries allowed Yes No No No No Join points allowed Yes No Yes No Yes Code motion down joins Yes No No No No Must be if-convertible No No Yes No No Tail dup. before sched. No Yes No Yes No
Extended basic block scheduling: Scope
H.C. TD 5102 56
Extended basic block scheduling: Code Motion
A a) add r4, r4, 4 b) beq . . .
D e) st r1, 8(r4)
C d) sub r1, r1, r2
B c) add r1, r1, r2
• Downward code motions?
— a B, a C, a D, c D, d D
• Upward code motions?
— c A, d A, e B, e C, e A
H.C. TD 5102 57
Extended basic block scheduling: Code Motion
D/b
ID
II
ID
b’
M
M
M
MD
I
M
b’
b
Legend:
Basic blocks between source and destination basic blocks
Control flow edges where off-liveness checks have to be performed
Basic blocks where duplication have to be placed
Destination basic blocks
Source basic blocks
• SCP (single copy on a path) rule: no path may exist between 2 different D blocks
H.C. TD 5102 58
Extended basic block scheduling:Code Motion
• A dominates B A is always executed before B– Consequently:
• A does not dominate B code motion from B to A requires
code duplication
• B post-dominates A B is always executed after A– Consequently:
• B does not post-dominate A code motion from B to A is speculative
A
CB
ED
F
Q1: does C dominate E?
Q2: does C dominate D?
Q3: does F post-dominate D?
Q4: does D post-dominate B?
H.C. TD 5102 59
Scheduling: Loops
B C
D
A
B
C’’
D
A
C’
C B
C’’
D
A
C’
C
Loop peeling Loop unrolling
Loop Optimizations:
H.C. TD 5102 60
Scheduling: LoopsProblems with unrolling:
• Exploits only parallelism within sets of n iterations
• Iteration start-up latency
• Code expansion
Basic block scheduling
Basic block scheduling and unrolling
Software pipelining
reso
urc
e u
tiliz
atio
n
time
H.C. TD 5102 61
Software pipelining• Software pipelining a loop is:
– Scheduling the loop such that iterations start before preceding iterations have finished
Or:– Moving operations across the backedge
LD
ML
ST
LD
LD ML
LD ML ST
ML ST
ST
LD
LD ML
LD ML ST
ML ST
ST
Example: y = a.x
3 cycles/iteration Unroling
5/3 cycles/iteration
Software pipelining
1 cycle/iteration
H.C. TD 5102 62
Software pipelining: Modulo scheduling
Example: Modulo scheduling a loop
for (i = 0; i < n; i++)
a[i+6] = 3* a[i] - 1;
(a) Example loop
ld r1,(r2)
mul r3,r1,3
sub r4,r3,1
st r4,(r5)
(b) Code without loop control
ld r1,(r2)
mul r3,r1,3
sub r4,r3,1
st r4,(r5)
ld r1,(r2)
mul r3,r1,3
sub r4,r3,1
st r4,(r5)
ld r1,(r2)
mul r3,r1,3
sub r4,r3,1
st r4,(r5)
ld r1,(r2)
mul r3,r1,3
sub r4,r3,1
st r4,(r5)
Prologue
Kernel
Epilogue
(c) Software pipeline
• Prologue fills the SW pipeline with iterations• Epilogue drains the SW pipeline
H.C. TD 5102 63
Summary and Conclusions
• Compilation for ILP architectures is getting mature and enters the commercial area.
• However:– Great discrepancy between available and
exploitable parallelism
What if you need more parallelism?
- source-to-source transformations
- use other algorithms
H.C. TD 5102 64