Download pdf - Software Pipelining on Multi-Core Architectures

Software Pipelining on Multi-Core Architectures

Alban Douillet, Guang R. Gao

CISC 879 : Software Support for Multicore Architectures

Tom St. JohnDept of Electrical and Computer Engineering

University of Delaware

Outline

• Introduction

• Single-Dimension Software Pipelining


• Multi-Threaded Software Pipelining

• Experiments

Introduction

• Software pipelining is among most successful optimizations

• Can it be applied to multi-core chips?

• What extensions are required?


Single-Dimension SWP

• Does not simply pipeline innermost loop

- Pipelines most profitable loop level

• Loop levels enclosing selected loop left to global scheduler

Selected loop seen as outermost loop


- Selected loop seen as outermost loop

- Inner loops executed sequentially

• Able to take advantage of ILP/data locality properties present in other loops

SSP Example


SSP Example


SSP Example


Multi-Threaded SWP

• Several Obstacles Exist

• Dependences/resource constraints must be respected

• Operation cannot be scheduled before all dependences are satisfied


dependences are satisfied

• Memory dependences may exist between thread units

- Synchronization is required

Multi-ThreadedFinal Schedule

• Schedule each group of Sn iterations on a thread unit using round-robin approach

- Workload balance is fair

• Sn is max number of iterations that can be executed in parallel without resource conflict


in parallel without resource conflict

• Thread units may share same instruction cache

Final Schedule Example


Final Schedule Example


Data Dependencies

• Data dependencies may exist between outermost iterations

• Synchronization points are chosen to minimize code duplication during code generation

- WAIT is placed before each repeating pattern


- WAIT is placed before each repeating pattern

- SIGNAL is placed after each pattern

• Synchronization delay guarantees the correctness of the schedule

Synchronization DelayExample






Synchronization

• Each thread has two counters

- Synchronization counter counts number of synchronization signals received

- Clock counter incremented after each WAIT

When thread reaches a WAIT, execution continues


• When thread reaches a WAIT, execution continues only if synchronization counter greater or equal to clock counter

• WAIT implemented with an active loop

• SIGNAL is a non-blocking atomic add-in-memory instruction

Innermost Loop Tiling

• Allows for coarser-grain synchronization

• Execution of Nn - 1 instances of the innermost loop pattern is tiled into tiles of G iterations

• WAIT and SIGNAL are issued at the entrance and exit of each tile


exit of each tile

• Gmin, value of G that minimizes final schedule length, can be approximated at compile time

Cross-IterationRegister Dependences

• Assume thread units do not share registers

• Insert copy operations to copy value from one thread unit to next

• Register dependence transformed into memory dependence


dependence

• Issue memory spill instruction to copy from register to scratch-pad memory of destination thread

• Value restored using local memory load

Cross-IterationRegister Dependences

• Memory spill instructions only need to be issued by the last iteration of an iteration group

• Memory restore instructions only need to be issued by the first iteration


• If distance of dependence is greater than 1, cascading copies and memory spills/restores will bring value to target iteration

Cross-IterationDependence Example


Correctness & Properties

• The multi-core final schedule represented by the schedule function is correct

• The multi-threaded final schedule is deadlock-free

• The synchronization signal guarantees that the memory accesses preceding it on the same thread


memory accesses preceding it on the same thread unit have been committed

Experimental Framework

• The MTS method has been implemented on the Open64 compiler retargeted for the IBM Cyclops64 architecture

• Loop nests from the Livermore Suite, SPEC2000 and NAS were used


Execution Time Speedup


• MTS schedules showed very good scalability, with relative speedup between 57.5 and 81 for 99 threads

Loop Tiling Factor


• Timing results using tiling factor Gmin match results using best empirical tiling factor