82
Compiler Optimisation 6 – Instruction Scheduling Hugh Leather IF 1.18a [email protected] Institute for Computing Systems Architecture School of Informatics University of Edinburgh 2019

Compiler Optimisation - 6 Instruction Scheduling

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Compiler Optimisation - 6 Instruction Scheduling

Compiler Optimisation6 – Instruction Scheduling

Hugh LeatherIF 1.18a

[email protected]

Institute for Computing Systems ArchitectureSchool of Informatics

University of Edinburgh

2019

Page 2: Compiler Optimisation - 6 Instruction Scheduling

Introduction

This lecture:Scheduling to hide latency and exploit ILPDependence graphLocal list Scheduling + prioritiesForward versus backward schedulingSoftware pipelining of loops

Page 3: Compiler Optimisation - 6 Instruction Scheduling

Latency, functional units, and ILP

Instructions take clock cycles to execute (latency)Modern machines issue several operations per cycleCannot use results until ready, can do something elseExecution time is order-dependentLatencies not always constant (cache, early exit, etc)

Operation Cyclesload, store 3load /∈ cache 100sloadI, add, shift 1mult 2div 40branch 0 – 8

Page 4: Compiler Optimisation - 6 Instruction Scheduling

Machine types

In orderDeep pipelining allows multiple instructions

SuperscalarMultiple functional units, can issue > 1 instruction

Out of orderLarge window of instructions can be reordered dynamically

VLIWCompiler statically allocates to FUs

Page 5: Compiler Optimisation - 6 Instruction Scheduling

Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready

Simple schedule1 a := 2*a*b*c

Cycle Operations Operands waitingloadAI rarp ,@a ⇒ r1

add r1, r1 ⇒ r1loadAI rarp ,@b ⇒ r2

mult r1, r2 ⇒ r1loadAI rarp ,@c ⇒ r2

mult r1, r2 ⇒ r1storeAI r1 ⇒ rarp ,@a

Done

1loads/stores 3 cycles, mults 2, adds 1

Page 6: Compiler Optimisation - 6 Instruction Scheduling

Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready

Simple schedule1 a := 2*a*b*c

Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r1

add r1, r1 ⇒ r1loadAI rarp ,@b ⇒ r2

mult r1, r2 ⇒ r1loadAI rarp ,@c ⇒ r2

mult r1, r2 ⇒ r1storeAI r1 ⇒ rarp ,@a

Done

1loads/stores 3 cycles, mults 2, adds 1

Page 7: Compiler Optimisation - 6 Instruction Scheduling

Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready

Simple schedule1 a := 2*a*b*c

Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r14 add r1, r1 ⇒ r1 r1

loadAI rarp ,@b ⇒ r2mult r1, r2 ⇒ r1

loadAI rarp ,@c ⇒ r2mult r1, r2 ⇒ r1

storeAI r1 ⇒ rarp ,@aDone

1loads/stores 3 cycles, mults 2, adds 1

Page 8: Compiler Optimisation - 6 Instruction Scheduling

Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready

Simple schedule1 a := 2*a*b*c

Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r14 add r1, r1 ⇒ r1 r15 loadAI rarp ,@b ⇒ r2 r26 r27 r2

mult r1, r2 ⇒ r1loadAI rarp ,@c ⇒ r2

mult r1, r2 ⇒ r1storeAI r1 ⇒ rarp ,@a

Done

1loads/stores 3 cycles, mults 2, adds 1

Page 9: Compiler Optimisation - 6 Instruction Scheduling

Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready

Simple schedule1 a := 2*a*b*c

Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r14 add r1, r1 ⇒ r1 r15 loadAI rarp ,@b ⇒ r2 r26 r27 r28 mult r1, r2 ⇒ r1 r19 Next op does not use r1 r1

loadAI rarp ,@c ⇒ r2mult r1, r2 ⇒ r1

storeAI r1 ⇒ rarp ,@aDone

1loads/stores 3 cycles, mults 2, adds 1

Page 10: Compiler Optimisation - 6 Instruction Scheduling

Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready

Simple schedule1 a := 2*a*b*c

Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r14 add r1, r1 ⇒ r1 r15 loadAI rarp ,@b ⇒ r2 r26 r27 r28 mult r1, r2 ⇒ r1 r19 loadAI rarp ,@c ⇒ r2 r1, r2

10 r211 r2

mult r1, r2 ⇒ r1storeAI r1 ⇒ rarp ,@a

Done

1loads/stores 3 cycles, mults 2, adds 1

Page 11: Compiler Optimisation - 6 Instruction Scheduling

Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready

Simple schedule1 a := 2*a*b*c

Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r14 add r1, r1 ⇒ r1 r15 loadAI rarp ,@b ⇒ r2 r26 r27 r28 mult r1, r2 ⇒ r1 r19 loadAI rarp ,@c ⇒ r2 r1, r2

10 r211 r212 mult r1, r2 ⇒ r1 r113 r1

storeAI r1 ⇒ rarp ,@aDone

1loads/stores 3 cycles, mults 2, adds 1

Page 12: Compiler Optimisation - 6 Instruction Scheduling

Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready

Simple schedule1 a := 2*a*b*c

Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r14 add r1, r1 ⇒ r1 r15 loadAI rarp ,@b ⇒ r2 r26 r27 r28 mult r1, r2 ⇒ r1 r19 loadAI rarp ,@c ⇒ r2 r1, r2

10 r211 r212 mult r1, r2 ⇒ r1 r113 r114 storeAI r1 ⇒ rarp ,@a store to complete15 store to complete16 store to complete

Done

1loads/stores 3 cycles, mults 2, adds 1

Page 13: Compiler Optimisation - 6 Instruction Scheduling

Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready

Schedule loads early2 a := 2*a*b*c

Cycle Operations Operands waitingloadAI rarp ,@a ⇒ r1loadAI rarp ,@b ⇒ r2loadAI rarp ,@c ⇒ r3

add r1, r1 ⇒ r1mult r1, r2 ⇒ r1mult r1, r2 ⇒ r1

storeAI r1 ⇒ rarp ,@aDone

2loads/stores 3 cycles, mults 2, adds 1

Page 14: Compiler Optimisation - 6 Instruction Scheduling

Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready

Schedule loads early2 a := 2*a*b*c

Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r1

loadAI rarp ,@b ⇒ r2loadAI rarp ,@c ⇒ r3

add r1, r1 ⇒ r1mult r1, r2 ⇒ r1mult r1, r3 ⇒ r1

storeAI r1 ⇒ rarp ,@aDone

2loads/stores 3 cycles, mults 2, adds 1

Page 15: Compiler Optimisation - 6 Instruction Scheduling

Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready

Schedule loads early2 a := 2*a*b*c

Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r2

loadAI rarp ,@c ⇒ r3add r1, r1 ⇒ r1

mult r1, r2 ⇒ r1mult r1, r3 ⇒ r1

storeAI r1 ⇒ rarp ,@aDone

2loads/stores 3 cycles, mults 2, adds 1

Page 16: Compiler Optimisation - 6 Instruction Scheduling

Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready

Schedule loads early2 a := 2*a*b*c

Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r23 loadAI rarp ,@c ⇒ r3 r1, r2, r3

add r1, r1 ⇒ r1mult r1, r2 ⇒ r1mult r1, r3 ⇒ r1

storeAI r1 ⇒ rarp ,@aDone

2loads/stores 3 cycles, mults 2, adds 1

Page 17: Compiler Optimisation - 6 Instruction Scheduling

Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready

Schedule loads early2 a := 2*a*b*c

Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r23 loadAI rarp ,@c ⇒ r3 r1, r2, r34 add r1, r1 ⇒ r1 r1, r2, r3

mult r1, r2 ⇒ r1mult r1, r3 ⇒ r1

storeAI r1 ⇒ rarp ,@aDone

2loads/stores 3 cycles, mults 2, adds 1

Page 18: Compiler Optimisation - 6 Instruction Scheduling

Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready

Schedule loads early2 a := 2*a*b*c

Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r23 loadAI rarp ,@c ⇒ r3 r1, r2, r34 add r1, r1 ⇒ r1 r1, r2, r35 mult r1, r2 ⇒ r1 r1, r36 r1

mult r1, r3 ⇒ r1storeAI r1 ⇒ rarp ,@a

Done

2loads/stores 3 cycles, mults 2, adds 1

Page 19: Compiler Optimisation - 6 Instruction Scheduling

Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready

Schedule loads early2 a := 2*a*b*c

Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r23 loadAI rarp ,@c ⇒ r3 r1, r2, r34 add r1, r1 ⇒ r1 r1, r2, r35 mult r1, r2 ⇒ r1 r1, r36 r17 mult r1, r3 ⇒ r1 r18 r1

storeAI r1 ⇒ rarp ,@aDone

2loads/stores 3 cycles, mults 2, adds 1

Page 20: Compiler Optimisation - 6 Instruction Scheduling

Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready

Schedule loads early2 a := 2*a*b*c

Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r23 loadAI rarp ,@c ⇒ r3 r1, r2, r34 add r1, r1 ⇒ r1 r1, r2, r35 mult r1, r2 ⇒ r1 r1, r36 r17 mult r1, r3 ⇒ r1 r18 r19 storeAI r1 ⇒ rarp ,@a store to complete

10 store to complete11 store to complete

DoneUses one more register

11 versus 16 cycles – 31% faster!

2loads/stores 3 cycles, mults 2, adds 1

Page 21: Compiler Optimisation - 6 Instruction Scheduling

Scheduling problem

Schedule maps operations to cycle; ∀a ∈ Ops, S(a) ∈ NRespect latency;∀a, b ∈ Ops, a dependson b =⇒ S(a) ≥ S(b) + λ(b)Respect function units; no more ops per type per cycle thanFUs can handle

Length of schedule, L(S) = maxa∈Ops(S(a) + λ(a))Schedule S is time-optimal if ∀S1, L(S) ≤ L(S1)

Problem: Find a time-optimal schedule3

Even local scheduling with many restrictions is NP-complete

3A schedule might also be optimal in terms of registers, power, or space

Page 22: Compiler Optimisation - 6 Instruction Scheduling

List scheduling

Local greedy heuristic to produce schedules for single basic blocks1 Rename to avoid anti-dependences2 Build dependency graph3 Prioritise operations4 For each cycle

1 Choose the highest priority ready operation & schedule it2 Update ready queue

Page 23: Compiler Optimisation - 6 Instruction Scheduling

List schedulingDependence/Precedence graph

Schedule operation only when operands readyBuild dependency graph of read-after-write (RAW) deps

Label with latency and FU requirements

Anti-dependences (WAR) restrict movement

Example: a = 2*a*b*c

Page 24: Compiler Optimisation - 6 Instruction Scheduling

List schedulingDependence/Precedence graph

Schedule operation only when operands readyBuild dependency graph of read-after-write (RAW) deps

Label with latency and FU requirementsAnti-dependences (WAR) restrict movement

Example: a = 2*a*b*c

Page 25: Compiler Optimisation - 6 Instruction Scheduling

List schedulingDependence/Precedence graph

Schedule operation only when operands readyBuild dependency graph of read-after-write (RAW) deps

Label with latency and FU requirementsAnti-dependences (WAR) restrict movement – renamingremoves

Example: a = 2*a*b*c

Page 26: Compiler Optimisation - 6 Instruction Scheduling

List scheduling

List scheduling algorithmCycle ← 1Ready ← leaves of (D)Active ← ∅while(Ready ∪ Active 6= ∅)

∀a ∈ Active where S(a) + λ(a) ≤ CycleActive ← Active - a∀ b ∈ succs(a) where isready(b)

Ready ← Ready ∪ bif ∃ a ∈ Ready and ∀ b, apriority ≥ bpriority

Ready ← Ready - aS(op) ← CycleActive ← Active ∪ a

Cycle ← Cycle + 1

Page 27: Compiler Optimisation - 6 Instruction Scheduling

List schedulingPriorities

Many different priorities usedQuality of schedules depends on good choice

The longest latency path or critical path is a good priorityTie breakers

Last use of a value - decreases demand for register as moves itnearer defNumber of descendants - encourages scheduler to pursuemultiple pathsLonger latency first - others can fit in shadowRandom

Page 28: Compiler Optimisation - 6 Instruction Scheduling

List schedulingExample: Schedule with priority by critical path length

Page 29: Compiler Optimisation - 6 Instruction Scheduling

List schedulingExample: Schedule with priority by critical path length

Page 30: Compiler Optimisation - 6 Instruction Scheduling

List schedulingExample: Schedule with priority by critical path length

Page 31: Compiler Optimisation - 6 Instruction Scheduling

List schedulingExample: Schedule with priority by critical path length

Page 32: Compiler Optimisation - 6 Instruction Scheduling

List schedulingExample: Schedule with priority by critical path length

Page 33: Compiler Optimisation - 6 Instruction Scheduling

List schedulingExample: Schedule with priority by critical path length

Page 34: Compiler Optimisation - 6 Instruction Scheduling

List schedulingExample: Schedule with priority by critical path length

Page 35: Compiler Optimisation - 6 Instruction Scheduling

List schedulingExample: Schedule with priority by critical path length

Page 36: Compiler Optimisation - 6 Instruction Scheduling

List schedulingExample: Schedule with priority by critical path length

Page 37: Compiler Optimisation - 6 Instruction Scheduling

List schedulingExample: Schedule with priority by critical path length

Page 38: Compiler Optimisation - 6 Instruction Scheduling

List schedulingExample: Schedule with priority by critical path length

Page 39: Compiler Optimisation - 6 Instruction Scheduling

List schedulingExample: Schedule with priority by critical path length

Page 40: Compiler Optimisation - 6 Instruction Scheduling

List schedulingForward vs backward

Can schedule from root to leaves (backward)May change schedule timeList scheduling cheap, so try both, choose best

Page 41: Compiler Optimisation - 6 Instruction Scheduling

List schedulingForward vs backward

Opcode loadI lshift add addI cmp storeLatency 1 1 2 1 1 4

Page 42: Compiler Optimisation - 6 Instruction Scheduling

List schedulingForward vs backward

ForwardsInt Int Stores

1 loadI1 lshift2 loadI2 loadI33 loadI4 add14 add2 add35 add4 addI store16 cmp store27 store38 store49 store510111213 cbr

BackwardsInt Int Stores

1 loadI12 addI lshift3 add4 loadI34 add3 loadI2 store55 add2 loadI1 store46 add1 store37 store28 store191011 cmp12 cbr

Page 43: Compiler Optimisation - 6 Instruction Scheduling

Scheduling Larger Regions

Schedule extended basic blocks (EBBs)Super block cloning

Schedule tracesSoftware pipelining

Page 44: Compiler Optimisation - 6 Instruction Scheduling

Scheduling Larger RegionsExtended basic blocks

Extended basic blockEBB is maximal set of blocks such thatSet has a single entry, BiEach block Bj other than Bi has

exactly one predecessor

Page 45: Compiler Optimisation - 6 Instruction Scheduling

Scheduling Larger RegionsExtended basic blocks

Extended basic blockEBB is maximal set of blocks such thatSet has a single entry, BiEach block Bj other than Bi has

exactly one predecessor

Page 46: Compiler Optimisation - 6 Instruction Scheduling

Scheduling Larger RegionsExtended basic blocks

Schedule entire paths throughEBBsExample has four EBB paths

Having B1 in both causes conflicts

Moving an op out of B1 causesproblemsMust insert compensation codeMoving an op into B1 causesproblems

Page 47: Compiler Optimisation - 6 Instruction Scheduling

Scheduling Larger RegionsExtended basic blocks

Schedule entire paths throughEBBsExample has four EBB paths

Having B1 in both causes conflicts

Moving an op out of B1 causesproblemsMust insert compensation codeMoving an op into B1 causesproblems

Page 48: Compiler Optimisation - 6 Instruction Scheduling

Scheduling Larger RegionsExtended basic blocks

Schedule entire paths throughEBBsExample has four EBB paths

Having B1 in both causes conflicts

Moving an op out of B1 causesproblemsMust insert compensation codeMoving an op into B1 causesproblems

Page 49: Compiler Optimisation - 6 Instruction Scheduling

Scheduling Larger RegionsExtended basic blocks

Schedule entire paths throughEBBsExample has four EBB paths

Having B1 in both causes conflicts

Moving an op out of B1 causesproblemsMust insert compensation codeMoving an op into B1 causesproblems

Page 50: Compiler Optimisation - 6 Instruction Scheduling

Scheduling Larger RegionsExtended basic blocks

Schedule entire paths throughEBBsExample has four EBB pathsHaving B1 in both causes conflicts

Moving an op out of B1 causesproblems

Must insert compensation codeMoving an op into B1 causesproblems

Page 51: Compiler Optimisation - 6 Instruction Scheduling

Scheduling Larger RegionsExtended basic blocks

Schedule entire paths throughEBBsExample has four EBB pathsHaving B1 in both causes conflicts

Moving an op out of B1 causesproblemsMust insert compensation code

Moving an op into B1 causesproblems

Page 52: Compiler Optimisation - 6 Instruction Scheduling

Scheduling Larger RegionsExtended basic blocks

Schedule entire paths throughEBBsExample has four EBB pathsHaving B1 in both causes conflicts

Moving an op out of B1 causesproblemsMust insert compensation code

Moving an op into B1 causesproblems

Page 53: Compiler Optimisation - 6 Instruction Scheduling

Scheduling Larger RegionsSuperblock cloning

Join points create context problems

Clone blocks to create morecontextMerge any simple control flowSchedule EBBs

Page 54: Compiler Optimisation - 6 Instruction Scheduling

Scheduling Larger RegionsSuperblock cloning

Join points create context problemsClone blocks to create morecontext

Merge any simple control flowSchedule EBBs

Page 55: Compiler Optimisation - 6 Instruction Scheduling

Scheduling Larger RegionsSuperblock cloning

Join points create context problemsClone blocks to create morecontextMerge any simple control flow

Schedule EBBs

Page 56: Compiler Optimisation - 6 Instruction Scheduling

Scheduling Larger RegionsSuperblock cloning

Join points create context problemsClone blocks to create morecontextMerge any simple control flowSchedule EBBs

Page 57: Compiler Optimisation - 6 Instruction Scheduling

Scheduling Larger RegionsTrace scheduling

Edge frequency from profile (notblock frequency)

Pick “hot” pathSchedule with compensation codeRemove from CFGRepeat

Page 58: Compiler Optimisation - 6 Instruction Scheduling

Scheduling Larger RegionsTrace scheduling

Edge frequency from profile (notblock frequency)Pick “hot” pathSchedule with compensation code

Remove from CFGRepeat

Page 59: Compiler Optimisation - 6 Instruction Scheduling

Scheduling Larger RegionsTrace scheduling

Edge frequency from profile (notblock frequency)Pick “hot” pathSchedule with compensation codeRemove from CFG

Repeat

Page 60: Compiler Optimisation - 6 Instruction Scheduling

Scheduling Larger RegionsTrace scheduling

Edge frequency from profile (notblock frequency)Pick “hot” pathSchedule with compensation codeRemove from CFGRepeat

Page 61: Compiler Optimisation - 6 Instruction Scheduling

Loop scheduling

Loop structures can dominate execution timeSpecialist technique software pipeliningAllows application of list scheduling to loops

Why not loop unrolling?

Allows loop effect to become arbitrarily small, butCode growth, cache pressure, register pressure

Page 62: Compiler Optimisation - 6 Instruction Scheduling

Loop scheduling

Loop structures can dominate execution timeSpecialist technique software pipeliningAllows application of list scheduling to loops

Why not loop unrolling?Allows loop effect to become arbitrarily small, butCode growth, cache pressure, register pressure

Page 63: Compiler Optimisation - 6 Instruction Scheduling

Software pipelining

Consider simple loop to sum array

Page 64: Compiler Optimisation - 6 Instruction Scheduling

Software pipeliningSchedule on 1 FU - 5 cycles

load 3 cycles, add 1 cycle, branch 1 cycle

Page 65: Compiler Optimisation - 6 Instruction Scheduling

Software pipeliningSchedule on VLIW 3 FUs - 4 cycles

load 3 cycles, add 1 cycle, branch 1 cycle

Page 66: Compiler Optimisation - 6 Instruction Scheduling

Software pipeliningA better steady state schedule exists

load 3 cycles, add 1 cycle, branch 1 cycle

Page 67: Compiler Optimisation - 6 Instruction Scheduling

Software pipeliningRequires prologue and epilogue (may schedule others in epilogue)

load 3 cycles, add 1 cycle, branch 1 cycle

Page 68: Compiler Optimisation - 6 Instruction Scheduling

Software pipeliningRespect dependences and latency – including loop carries

load 3 cycles, add 1 cycle, branch 1 cycle

Page 69: Compiler Optimisation - 6 Instruction Scheduling

Software pipeliningComplete code

load 3 cycles, add 1 cycle, branch 1 cycle

Page 70: Compiler Optimisation - 6 Instruction Scheduling

Software pipeliningSome definitions

Initiation interval (ii)Number of cycles between initiating loop iterations

Original loop had ii of 5 cyclesFinal loop had ii of 2 cycles

RecurrenceLoop-based computation whose value is used in later loop iteration

Might be several iterations laterHas dependency chain(s) on itselfRecurrence latency is latency of dependency chain

Page 71: Compiler Optimisation - 6 Instruction Scheduling

Software pipeliningAlgorithm

Choose an initiation interval, iiCompute lower bounds on iiShorter ii means faster overall execution

Generate a loop body that takes ii cyclesTry to schedule into ii cycles, using modulo schedulerIf it fails, increase ii by one and try again

Generate the needed prologue and epilogue codeFor prologue, work backward from upward exposed uses in thescheduled loop bodyFor epilogue, work forward from downward exposed definitionsin the scheduled loop body

Page 72: Compiler Optimisation - 6 Instruction Scheduling

Software pipeliningInitial initiation interval (ii)

Starting value for ii based on minimum resource and recurrenceconstraints

Resource constraintii must be large enough to issue every operationLet Nu = number of FUs of type uLet Iu = number of operations of type udIu/Nue is lower bound on ii for type umaxu(dIu/Nue) is lower bound on ii

Page 73: Compiler Optimisation - 6 Instruction Scheduling

Software pipeliningInitial initiation interval (ii)

Starting value for ii based on minimum resource and recurrenceconstraints

Recurrence constraintii cannot be smaller than longest recurrence latencyRecurrence r is over kr iterations with latency λr

dλr/kue is lower bound on ii for type rmaxr(dλr/kue) is lower bound on ii

Page 74: Compiler Optimisation - 6 Instruction Scheduling

Software pipeliningInitial initiation interval (ii)

Starting value for ii based on minimum resource and recurrenceconstraints

Start value = max(maxu(dIu/Nue),maxr (dλr/kue)

For simple loop

a = A[ i ]b = b + ai = i + 1if i < n gotoend

Resource constraintMemory Integer Branch

Iu 1 2 1Nu 1 1 1

dIu/Nue 1 2 1Recurrence constraint

b ikr 1 1λr 2 1

dIu/Nue 2 1

Page 75: Compiler Optimisation - 6 Instruction Scheduling

Software pipeliningModulo scheduling

Modulo schedulingSchedule with cycle modulo initiation interval

Page 76: Compiler Optimisation - 6 Instruction Scheduling

Software pipeliningModulo scheduling

Modulo schedulingSchedule with cycle modulo initiation interval

Page 77: Compiler Optimisation - 6 Instruction Scheduling

Software pipeliningModulo scheduling

Modulo schedulingSchedule with cycle modulo initiation interval

Page 78: Compiler Optimisation - 6 Instruction Scheduling

Software pipeliningModulo scheduling

Modulo schedulingSchedule with cycle modulo initiation interval

Page 79: Compiler Optimisation - 6 Instruction Scheduling

Software pipeliningModulo scheduling

Modulo schedulingSchedule with cycle modulo initiation interval

Page 80: Compiler Optimisation - 6 Instruction Scheduling

Software pipeliningCurrent research

Much research in different software pipelining techniquesDifficult when there is general control flow in the loopPredication in IA64 for example really helps hereSome recent work in exhaustive scheduling -i.e. solve theNP-complete problem for basic blocks

Page 81: Compiler Optimisation - 6 Instruction Scheduling

Summary

Scheduling to hide latency and exploit ILPDependence graph - dependences between instructions +latencyLocal list Scheduling + prioritiesForward versus backward schedulingScheduling EBBs, superblock cloning, trace schedulingSoftware pipelining of loops

Page 82: Compiler Optimisation - 6 Instruction Scheduling

PPar CDT Advert

The biggest revolution in the technological landscape for fifty years

Now accepting applications! Find out more and apply at:

pervasiveparallelism.inf.ed.ac.uk

• • 4-year programme: 4-year programme: MSc by Research + PhDMSc by Research + PhD

• Collaboration between: ▶ University of Edinburgh’s School of Informatics ✴ Ranked top in the UK by 2014 REF

▶ Edinburgh Parallel Computing Centre ✴ UK’s largest supercomputing centre

• Full funding available

• Industrial engagement programme includes internships at leading companies

• Research-focused: Work on your thesis topic from the start

• Research topics in software, hardware, theory and

application of: ▶ Parallelism ▶ Concurrency ▶ Distribution