Compiler Optimisation - 6 Instruction Scheduling

Compiler Optimisation6 – Instruction Scheduling

Hugh LeatherIF 1.18a

[email protected]

Institute for Computing Systems ArchitectureSchool of Informatics

University of Edinburgh

2019

Introduction

This lecture:Scheduling to hide latency and exploit ILPDependence graphLocal list Scheduling + prioritiesForward versus backward schedulingSoftware pipelining of loops

Latency, functional units, and ILP

Instructions take clock cycles to execute (latency)Modern machines issue several operations per cycleCannot use results until ready, can do something elseExecution time is order-dependentLatencies not always constant (cache, early exit, etc)

Operation Cyclesload, store 3load /∈ cache 100sloadI, add, shift 1mult 2div 40branch 0 – 8

Machine types

In orderDeep pipelining allows multiple instructions

SuperscalarMultiple functional units, can issue > 1 instruction

Out of orderLarge window of instructions can be reordered dynamically

VLIWCompiler statically allocates to FUs

Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready

Simple schedule1 a := 2*a*b*c

Cycle Operations Operands waitingloadAI rarp ,@a ⇒ r1

add r1, r1 ⇒ r1loadAI rarp ,@b ⇒ r2

mult r1, r2 ⇒ r1loadAI rarp ,@c ⇒ r2

mult r1, r2 ⇒ r1storeAI r1 ⇒ rarp ,@a

Done

1loads/stores 3 cycles, mults 2, adds 1



Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r1

add r1, r1 ⇒ r1loadAI rarp ,@b ⇒ r2



Done




Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r14 add r1, r1 ⇒ r1 r1

loadAI rarp ,@b ⇒ r2mult r1, r2 ⇒ r1

loadAI rarp ,@c ⇒ r2mult r1, r2 ⇒ r1

storeAI r1 ⇒ rarp ,@aDone




Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r14 add r1, r1 ⇒ r1 r15 loadAI rarp ,@b ⇒ r2 r26 r27 r2



Done




Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r14 add r1, r1 ⇒ r1 r15 loadAI rarp ,@b ⇒ r2 r26 r27 r28 mult r1, r2 ⇒ r1 r19 Next op does not use r1 r1

loadAI rarp ,@c ⇒ r2mult r1, r2 ⇒ r1





Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r14 add r1, r1 ⇒ r1 r15 loadAI rarp ,@b ⇒ r2 r26 r27 r28 mult r1, r2 ⇒ r1 r19 loadAI rarp ,@c ⇒ r2 r1, r2

10 r211 r2


Done





10 r211 r212 mult r1, r2 ⇒ r1 r113 r1






10 r211 r212 mult r1, r2 ⇒ r1 r113 r114 storeAI r1 ⇒ rarp ,@a store to complete15 store to complete16 store to complete

Done



Schedule loads early2 a := 2*a*b*c

Cycle Operations Operands waitingloadAI rarp ,@a ⇒ r1loadAI rarp ,@b ⇒ r2loadAI rarp ,@c ⇒ r3

add r1, r1 ⇒ r1mult r1, r2 ⇒ r1mult r1, r2 ⇒ r1





Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r1

loadAI rarp ,@b ⇒ r2loadAI rarp ,@c ⇒ r3






Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r2

loadAI rarp ,@c ⇒ r3add r1, r1 ⇒ r1

mult r1, r2 ⇒ r1mult r1, r3 ⇒ r1





Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r23 loadAI rarp ,@c ⇒ r3 r1, r2, r3






Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r23 loadAI rarp ,@c ⇒ r3 r1, r2, r34 add r1, r1 ⇒ r1 r1, r2, r3

mult r1, r2 ⇒ r1mult r1, r3 ⇒ r1





Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r23 loadAI rarp ,@c ⇒ r3 r1, r2, r34 add r1, r1 ⇒ r1 r1, r2, r35 mult r1, r2 ⇒ r1 r1, r36 r1


Done




Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r23 loadAI rarp ,@c ⇒ r3 r1, r2, r34 add r1, r1 ⇒ r1 r1, r2, r35 mult r1, r2 ⇒ r1 r1, r36 r17 mult r1, r3 ⇒ r1 r18 r1





Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r23 loadAI rarp ,@c ⇒ r3 r1, r2, r34 add r1, r1 ⇒ r1 r1, r2, r35 mult r1, r2 ⇒ r1 r1, r36 r17 mult r1, r3 ⇒ r1 r18 r19 storeAI r1 ⇒ rarp ,@a store to complete

10 store to complete11 store to complete

DoneUses one more register

11 versus 16 cycles – 31% faster!


Scheduling problem

Schedule maps operations to cycle; ∀a ∈ Ops, S(a) ∈ NRespect latency;∀a, b ∈ Ops, a dependson b =⇒ S(a) ≥ S(b) + λ(b)Respect function units; no more ops per type per cycle thanFUs can handle

Length of schedule, L(S) = maxa∈Ops(S(a) + λ(a))Schedule S is time-optimal if ∀S1, L(S) ≤ L(S1)

Problem: Find a time-optimal schedule3

Even local scheduling with many restrictions is NP-complete

3A schedule might also be optimal in terms of registers, power, or space

List scheduling

Local greedy heuristic to produce schedules for single basic blocks1 Rename to avoid anti-dependences2 Build dependency graph3 Prioritise operations4 For each cycle

1 Choose the highest priority ready operation & schedule it2 Update ready queue

List schedulingDependence/Precedence graph

Schedule operation only when operands readyBuild dependency graph of read-after-write (RAW) deps

Label with latency and FU requirements

Anti-dependences (WAR) restrict movement

Example: a = 2*a*b*c



Label with latency and FU requirementsAnti-dependences (WAR) restrict movement




Label with latency and FU requirementsAnti-dependences (WAR) restrict movement – renamingremoves


List scheduling

List scheduling algorithmCycle ← 1Ready ← leaves of (D)Active ← ∅while(Ready ∪ Active 6= ∅)

∀a ∈ Active where S(a) + λ(a) ≤ CycleActive ← Active - a∀ b ∈ succs(a) where isready(b)

Ready ← Ready ∪ bif ∃ a ∈ Ready and ∀ b, apriority ≥ bpriority

Ready ← Ready - aS(op) ← CycleActive ← Active ∪ a

Cycle ← Cycle + 1

List schedulingPriorities

Many different priorities usedQuality of schedules depends on good choice

The longest latency path or critical path is a good priorityTie breakers

Last use of a value - decreases demand for register as moves itnearer defNumber of descendants - encourages scheduler to pursuemultiple pathsLonger latency first - others can fit in shadowRandom

List schedulingExample: Schedule with priority by critical path length












List schedulingForward vs backward

Can schedule from root to leaves (backward)May change schedule timeList scheduling cheap, so try both, choose best


Opcode loadI lshift add addI cmp storeLatency 1 1 2 1 1 4


ForwardsInt Int Stores

1 loadI1 lshift2 loadI2 loadI33 loadI4 add14 add2 add35 add4 addI store16 cmp store27 store38 store49 store510111213 cbr

BackwardsInt Int Stores

1 loadI12 addI lshift3 add4 loadI34 add3 loadI2 store55 add2 loadI1 store46 add1 store37 store28 store191011 cmp12 cbr

Scheduling Larger Regions

Schedule extended basic blocks (EBBs)Super block cloning

Schedule tracesSoftware pipelining

Scheduling Larger RegionsExtended basic blocks

Extended basic blockEBB is maximal set of blocks such thatSet has a single entry, BiEach block Bj other than Bi has

exactly one predecessor


Extended basic blockEBB is maximal set of blocks such thatSet has a single entry, BiEach block Bj other than Bi has

exactly one predecessor


Schedule entire paths throughEBBsExample has four EBB paths

Having B1 in both causes conflicts

Moving an op out of B1 causesproblemsMust insert compensation codeMoving an op into B1 causesproblems














Schedule entire paths throughEBBsExample has four EBB pathsHaving B1 in both causes conflicts

Moving an op out of B1 causesproblems

Must insert compensation codeMoving an op into B1 causesproblems



Moving an op out of B1 causesproblemsMust insert compensation code

Moving an op into B1 causesproblems



Moving an op out of B1 causesproblemsMust insert compensation code

Moving an op into B1 causesproblems

Scheduling Larger RegionsSuperblock cloning

Join points create context problems

Clone blocks to create morecontextMerge any simple control flowSchedule EBBs


Join points create context problemsClone blocks to create morecontext

Merge any simple control flowSchedule EBBs


Join points create context problemsClone blocks to create morecontextMerge any simple control flow

Schedule EBBs


Join points create context problemsClone blocks to create morecontextMerge any simple control flowSchedule EBBs

Scheduling Larger RegionsTrace scheduling

Edge frequency from profile (notblock frequency)

Pick “hot” pathSchedule with compensation codeRemove from CFGRepeat


Edge frequency from profile (notblock frequency)Pick “hot” pathSchedule with compensation code

Remove from CFGRepeat


Edge frequency from profile (notblock frequency)Pick “hot” pathSchedule with compensation codeRemove from CFG

Repeat


Edge frequency from profile (notblock frequency)Pick “hot” pathSchedule with compensation codeRemove from CFGRepeat

Loop scheduling

Loop structures can dominate execution timeSpecialist technique software pipeliningAllows application of list scheduling to loops

Why not loop unrolling?

Allows loop effect to become arbitrarily small, butCode growth, cache pressure, register pressure

Loop scheduling

Loop structures can dominate execution timeSpecialist technique software pipeliningAllows application of list scheduling to loops

Why not loop unrolling?Allows loop effect to become arbitrarily small, butCode growth, cache pressure, register pressure

Software pipelining

Consider simple loop to sum array

Software pipeliningSchedule on 1 FU - 5 cycles

load 3 cycles, add 1 cycle, branch 1 cycle

Software pipeliningSchedule on VLIW 3 FUs - 4 cycles


Software pipeliningA better steady state schedule exists


Software pipeliningRequires prologue and epilogue (may schedule others in epilogue)


Software pipeliningRespect dependences and latency – including loop carries


Software pipeliningComplete code


Software pipeliningSome definitions

Initiation interval (ii)Number of cycles between initiating loop iterations

Original loop had ii of 5 cyclesFinal loop had ii of 2 cycles

RecurrenceLoop-based computation whose value is used in later loop iteration

Might be several iterations laterHas dependency chain(s) on itselfRecurrence latency is latency of dependency chain

Software pipeliningAlgorithm

Choose an initiation interval, iiCompute lower bounds on iiShorter ii means faster overall execution

Generate a loop body that takes ii cyclesTry to schedule into ii cycles, using modulo schedulerIf it fails, increase ii by one and try again

Generate the needed prologue and epilogue codeFor prologue, work backward from upward exposed uses in thescheduled loop bodyFor epilogue, work forward from downward exposed definitionsin the scheduled loop body

Software pipeliningInitial initiation interval (ii)

Starting value for ii based on minimum resource and recurrenceconstraints

Resource constraintii must be large enough to issue every operationLet Nu = number of FUs of type uLet Iu = number of operations of type udIu/Nue is lower bound on ii for type umaxu(dIu/Nue) is lower bound on ii



Recurrence constraintii cannot be smaller than longest recurrence latencyRecurrence r is over kr iterations with latency λr

dλr/kue is lower bound on ii for type rmaxr(dλr/kue) is lower bound on ii



Start value = max(maxu(dIu/Nue),maxr (dλr/kue)

For simple loop

a = A[ i ]b = b + ai = i + 1if i < n gotoend

Resource constraintMemory Integer Branch

Iu 1 2 1Nu 1 1 1

dIu/Nue 1 2 1Recurrence constraint

b ikr 1 1λr 2 1

dIu/Nue 2 1

Software pipeliningModulo scheduling

Modulo schedulingSchedule with cycle modulo initiation interval









Software pipeliningCurrent research

Much research in different software pipelining techniquesDifficult when there is general control flow in the loopPredication in IA64 for example really helps hereSome recent work in exhaustive scheduling -i.e. solve theNP-complete problem for basic blocks

Summary

Scheduling to hide latency and exploit ILPDependence graph - dependences between instructions +latencyLocal list Scheduling + prioritiesForward versus backward schedulingScheduling EBBs, superblock cloning, trace schedulingSoftware pipelining of loops

PPar CDT Advert

The biggest revolution in the technological landscape for fifty years

Now accepting applications! Find out more and apply at:

pervasiveparallelism.inf.ed.ac.uk

• • 4-year programme: 4-year programme: MSc by Research + PhDMSc by Research + PhD

• Collaboration between: ▶ University of Edinburgh’s School of Informatics ✴ Ranked top in the UK by 2014 REF

▶ Edinburgh Parallel Computing Centre ✴ UK’s largest supercomputing centre

• Full funding available

• Industrial engagement programme includes internships at leading companies

• Research-focused: Work on your thesis topic from the start

• Research topics in software, hardware, theory and

application of: ▶ Parallelism ▶ Concurrency ▶ Distribution

Documents

Compiler Optimisation - 6 Instruction Scheduling